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BEFORE YOU START 

This is the second volume of ULTRIX-32m Supplementary Documents , a three volume set 
that contains articles describing the ULTRIX-32m system. The authors are computer scien¬ 
tists and program developers at Bell Laboratories and the University of California at Berke¬ 
ley. The articles explain the software tools and utilities available on your ULTRIX-32m 
system. They constitute most of the lore that enriches this operating system; topics range 
from getting started to the details of screen updating and cursor movement facilities. 

Each volume in this set contains several parts, and each part begins with an introduction. 
Each introduction serves as a map that will help you find your way around in the documenta¬ 
tion, allowing you to select articles that relate to your interest. Each introduction gives an 
overview of the material covered in the part and a description of the articles included. Most 
readers will not need to read all articles in any part, since many articles cover parallel topics. 

These articles provide authoritative and accurate information that is unavailable elsewhere. 
However, you should be aware that some of the information in some articles is dated. We 
include those articles because many of the concepts they develop are still current and impor¬ 
tant. 

At the end of each volume in this set, you will find a master index identifying topics and 
related pages in the text for all three volumes. 


Topics in Volume II 

The articles in this second volume deal with programming and support tools for programmers 
on the ULTRIX-32m system. Most of the authors assume that readers are familiar with one 
or more programming languages. 

"UNIX Programming — Second Edition,” in Part 1 of this volume, tells how to write pro¬ 
grams that cooperate with the operating system. Many readers will find it useful to read this 
article before going on to articles on the languages and utilities. 

The articles in Part 2 deal with the C language and the M4 preprocessor. 






Part 3, Supporting Tools, offers articles on three kinds of utilities: 

• Program and library maintenance tools 

• Program checking and debugging tools 

• Compiler and preprocessor development tools 

And the articles in Part 4, System Programming, cover exotic topics such as: 

• Inner workings of the ULTRIX-32 system 

• System and kernel facilities available to user programs 

• Assembly language (as) 

• Screen manipulation functions 

• The ULTRIX-32m line printer spooler 

The features described in this volume provide the flexibility and programming power for 
which the UNIX system is famous. A good understanding of many of the concepts and proce¬ 
dures presented here is essential for efficient use of your ULTRIX-32m system. 





PART Is PROGRAMMING CONSIDERATIONS 


This part contains one article, "UNIX Programming — Second Edition,” by 
Kernighan and Ritchie. The article gives background information that will help you write 
programs that make full use of the ULTRIX—32m system. Readers should be familiar with 
the fundamentals of the ULTRIX-32m system (or the UNIX system). Although the tech¬ 
niques shown in the article apply to programming in any language available on the 
ULTRIX-32m system, the sample programs are written in the C language. 

The authors explain how to: 

• Pass arguments to and from a program 

• Send program output to a file, to a pipe, or to a terminal 

• Use the standard I/O (input/output) library 

• Handle I/O errors 

• Use low level I/O 

• Execute a program from within another 

• Handle signals (interrupts) 
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UNIX Programming — Second Edition 

Brian W. Kernighan 
Dennis M. Ritchie 


ABSTRACT 

This paper is an introduction to programming on the UNIXf system. 
The emphasis is on how to write programs that interface to the operating 
system, either directly or through the standard I/O library. The topics dis¬ 
cussed include 

• handling command arguments 

• rudimentary I/O; the standard input and output 

• the standard I/O library; hie system access 

• low-level I/O: open, read, write, close, seek 

• processes: exec, fork, pipes 

• signals — interrupts, etc. 

There is also an appendix which describes the standard I/O library in 
detail. 


September 7, 1984 


t UNIX is a trademark of Bell Laboratories. 






UNIX Programming — Second Edition 

Brian W. Kernighan 
Dennis M. Ritchie 


1. INTRODUCTION 

This paper describes how to write programs that interface with the UNIX operating 
system in a non-trivial way. This includes programs that use files by name, that use pipes, 
that invoke other commands as they run, or that attempt to catch interrupts and other sig¬ 
nals during execution. 

The document collects material which is scattered throughout several sections of The 
UNIX Programmer’s Manual [1] for Version 7 UNIX. There is no attempt to be complete; 
only generally useful material is dealt with. It is assumed that you will be programming in 
C, so you must be able to read the language roughly up to the level of The C Programming 
Language [2]. Some of the material in sections 2 through 4 is based on topics covered 
more carefully there. You should also be familiar with UNIX itself at least to the level of 
UNIX for Beginners [3]. 

2. BASICS 

2.1. Program Arguments 

When a C program is run as a command, the arguments on the command line are made 
available to the function main as an argument count argc and an array argv of pointers 
to character strings that contain the arguments. By convention, argv[0] is the com¬ 
mand name itself, so argc is always greater than 0. 

The following program illustrates the mechanism: it simply echoes its arguments back 
to the terminal. (This is essentially the echo command.) 

main(argc, argv) /* echo arguments */ 
int argc; 
char *argv[]; 

{ 

int i ; 

for (i = 1; i < argc; i++ ) 

printf (" %s%c" , argvfi], (i<argc-l) ? * * : * \ n’ ) ; 

) 

argv is a pointer to an array whose individual elements are pointers to arrays of charac¬ 
ters; each is terminated by \ 0, so they can be treated as strings. The program starts by 
printing argv[ 1 ] and loops until it has printed them all. 

The argument count and the arguments are parameters to main. If you want to keep 
them around so other routines can get at them, you must copy them to external variables. 

2.2. The “Standard Input” and “Standard Output” 

The simplest input mechanism is to read the “standard input,” which is generally the 
user’s terminal. The function get char returns the next input character each time it is 
called. A file may be substituted for the terminal by using the < convention: if prog uses 
get char, then the command line 
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prog < file 

causes prog to read file instead of the terminal, prog itself need know nothing about 
where its input is coming from. This is also true if the input comes from another program 
via the 

otherprog | prog 

provides the standard input for prog from the standard output of otherprog. 

getchar returns the value EOF when it encounters the end of file (or an error) on 
whatever you are reading. The value of EOF is normally defined to be -1, but it is unwise 
to take any advantage of that knowledge. As will become clear shortly, this value is 
automatically defined for you when you compile a program, and need not be of any con¬ 
cern. 

Similarly, putchar(c) puts the character c on the “standard output,” which is also by 
default the terminal. The output can be captured on a file by using >: if prog uses 
putchar, 

prog >outfile 

writes the standard output on outf ile instead of the terminal, outf ile is created if it 
doesn’t exist; if it already exists, its previous contents are overwritten. And a pipe can be 
used: 

prog | otherprog 

puts the standard output of prog into the standard input of otherprog. 

The function printf, which formats output in various ways, uses the same mechanism 
as putchar does, so calls to printf and putchar may be intermixed in any order; the 
output will appear in the order of the calls. 

Similarly, the function scanf provides for formatted input conversion; it will read the 
standard input and break it up into strings, numbers, etc., as desired, scanf uses the 
same mechanism as getchar, so calls to them may also be intermixed. 

Many programs read only one input and write one output; for such programs I/O with 
getchar, putchar, scanf, and printf may be entirely adequate, and it is almost always 
enough to get started. This is particularly true if the UNIX pipe facility is used to connect 
the output of one program to the input of the next. For example, the following program 
strips out all ascii control characters from its input (except for newline and tab). 

# include <stdio.h> 

main() /* ccstrip: strip non-graphic characters •/ 

( 

int c; 

while ((c = getchar()) != EOF) 

if ((c > = * * &&c< 0177) || c == ’\ t’ || c == ’\n’ ) 
putchar(c); 
exit(0); 

) 

The line 

# include <stdio.h> 

should appear at the beginning of each source file. It causes the C compiler to read a file 
(/usr/include/stdio.h) of standard routines and symbols that includes the definition of EOF. 

If it is necessary to treat multiple files, you can use cat to collect the files for you: 

cat filel file2 ... | ccstrip >output 
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and thus avoid learning how to access files from a program. By the way, the call to exit 
at the end is not necessary to make the program work properly, but it assures that any 
caller of the program will see a normal termination status (conventionally 0) from the pro¬ 
gram when it completes. Section 6 discusses status returns in more detail. 

3. THE STANDARD I/O LIBRARY 

The “Standard I/O Library” is a collection of routines intended to provide efficient and 
portable I/O services for most C programs. The standard I/O library is available on each 
system that supports C, so programs that confine their system interactions to its facilities 
can be transported from one system to another essentially without change. 

In this section, we will discuss the basics of the standard I/O library. The appendix 
contains a more complete description of its capabilities. 

3.1. File Access 

The programs written so far have all read the standard input and written the standard 
output, which we have assumed are magically pre-defined. The next step is to write a pro¬ 
gram that accesses a file that is not already connected to the program. One simple example 
is wc, which counts the lines, words and characters in a set of files. For instance, the com¬ 
mand 


wc x.c y.c 

prints the number of lines, words and characters in x. c and y. c and the totals. 

The question is how to arrange for the named files to be read — that is, how to con¬ 
nect the file system names to the I/O statements which actually read the data. 

The rules are simple. Before it can be read or written a file has to be opened by the 
standard library function fopen. fopen takes an external name (like x.c or y.c), does 
some housekeeping and negotiation with the operating system, and returns an internal 
name which must be used in subsequent reads or writes of the file. 

This internal name is actually a pointer, called a file pointer, to a structure which con¬ 
tains information about the file, such as the location of a buffer, the current character 
position in the buffer, whether the file is being read or written, and the like. Users don t 
need to know the details, because part of the standard I/O definitions obtained by includ¬ 
ing stdio. h is a structure definition called FILE. The only declaration needed for a file 
pointer is exemplified by 

FILE *fp, * fopen(); 

This says that fp is a pointer to a FILE, and fopen returns a pointer to a FILE. (FILE is 
a type name, like int, not a structure tag. 

The actual call to fopen in a program is 
fp * fopen(name, mode); 

The first argument of fopen is the name of the file, as a character string. The second 
argument is the mode, also as a character string, which indicates how you ^intend to use the 
file. The only allowable modes are read (" r" ), write (" w" ), or append (" a” ). 

If a file that you open for writing or appending does not exist, it is created (if possible). 
Opening an existing file for writing causes the old contents to be discarded. Trying to read 
a file that does not exist is an error, and there may be other causes of error as well (like 
trying to read a file when you don’t have permission). If there is any error, fopen will 
return the null pointer value NULL (which is defined as zero in stdio. h). 

The next thing needed is a way to read or write the file once it is open. There are 
several possibilities, of which getc and putc are the simplest, getc returns the next 
character from a file; it needs the file pointer to tell it what file. Thus 
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c — getc(fp) 

places in c the next character from the file referred to by fp; it returns EOF when it 
reaches end of file, putc is the inverse of getc: 

putc(c, fp) 

puts the character c on the file fp and returns c. getc and putc return EOF on error. 

When a program is started, three files are opened automatically, and file pointers are 
provided for them. These files are the standard input, the standard output, and the stan¬ 
dard error output; the corresponding file pointers are called stdin, stdout, and stderr. 
Normally these are all connected to the terminal, but may be redirected to files or pipes as 
described in Section 2.2. stdin, stdout and stderr are pre-defined in the I/O library as 
the standard input, output and error files; they may be used anywhere an object of type 
FILE * can be. They are constants, however, not variables, so don’t try to assign to them. 

With some of the preliminaries out of the way, we can now write wc. The basic design 
is one that has been found convenient for many programs: if there are command-line argu¬ 
ments, they are processed in order. If there are no arguments, the standard input is pro¬ 
cessed. This way the program can be used stand-alone or as part of a larger process. 
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# include <stdio.h> 


main(argc, argv) /* wc: count lines, words, chars •/ 
int argc; 
char *argv[]; 

( 

int c, i, inword; 

FILE *fp, *fopen(); 

long linect, wordct, charct; 

long tlinect = 0, twordct = 0, tcharct = 0; 

i = 1; 

fp = stdin; 

do { 

if (argc > 1 && (f p= f open (argv [ i ] , " r " )) == NULL) ( 

fprintf (stderr, "wc: can’t open %s\ n" , argv[i]); 
continue; 

) 

linect = wordct = charct = inword = 0; 
while ((c = getc(fp)) != EOF) { 
charct+ + ; 
if (c == ’ \ n’ ) 

1 inect+ + ; 

if (c == ’ ’ || c == ’ \ t’ || c == ’ \ n’ ) 

inword = 0; 

else if (inword == 0) { 

inword = 1; 
wordct+ + ; 

i 

printf (" %71d %71d %71d" , linect, wordct, charct); 

printf(argc > 1 ? " %s\n" : " \ n" , argv[i]); 

fclose(fp); 

tlinect += linect; 

twordct += wordct; 

tcharct += charct; 

} while (++ i < argc); 

if (argc >2) L 

printf(" %71d %71d %71d total\n" , tlinect, twordct, tcharct); 

exit(O); 

) 

The function fprintf is identical to printf, save that the first argument is a file pointer 
that specifies the file to be written. 

The function fclose is the inverse of fopen; it breaks the connection between the file 
pointer and the external name that was established by fopen, freeing the file pointer for 
another file. Since there is a limit on the number of files that a program may have open 
simultaneously, it’s a good idea to free things when they are no longer needed. There is 
also another reason to call fc lose on an output file — it flushes the buffer in which putc 
is collecting output, (f close is called automatically for each open file when a program 
terminates normally.) 

3.2. Error Handling — Stderr and Exit 

stderr is assigned to a program in the same way that stdin and stdout are. Output 
written on stderr appears on the user’s terminal even if the standard output is redirected. 
wc writes its diagnostics on stderr instead of stdout so that if one of the files can t be 
accessed for some reason, the message finds its way to the user’s terminal instead of disap¬ 
pearing down a pipeline or into an output file. 
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The program actually signals errors in another way, using the function exit to ter¬ 
minate program execution. The argument of exit is available to whatever process called 
it (see Section 6), so the success or failure of the program can be tested by another pro¬ 
gram that uses this one as a sub-process. By convention, a return value of 0 signals that 
all is well; non-zero values signal abnormal situations. 

exit itself calls fclose for each open output file, to flush out any buffered output, 
then calls a routine named —exit. The function —exit causes immediate termination 
without any buffer flushing; it may be called directly if desired. 

3.3. Miscellaneous I/O Functions 

The standard I/O library provides several other I/O functions besides those we have 
illustrated above. 

Normally output with putc, etc., is buffered (except to stderr); to force it out 
immediately, use fflush(fp). 

fscanf is identical to scanf, except that its first argument is a file pointer (as with 
fprintf) that specifies the file from which the input comes; it returns EOF at end of file. 

The functions sscanf and sprintf are identical to fscanf and fprintf, except 
that the first argument names a character string instead of a file pointer. The conversion 
is done from the string for sscanf and into it for sprintf. 

fgets(buf , size , fp) copies the next line from fp, up to and including a newline, 
into buf; at most size-1 characters are copied; it returns NULL at end of file, 
fputs(buf , fp) writes the string in buf onto file fp. 

The function ungetc(c, fp) “pushes back” the character c onto the input stream 
f p; a subsequent call to getc, fscanf, etc., will encounter c. Only one character of push- 
back per file is permitted. 

4. LOW-LEVEL I/O 

This section describes the bottom level of I/O on the UNIX system. The lowest level of 
I/O in UNIX provides no buffering or any other services; it is in fact a direct entry into the 
operating system. You are entirely on your own, but on the other hand, you have the most 
control over what happens. And since the calls and usage are quite simple, this isn’t as 
bad as it sounds. 

4.1. File Descriptors 

In the UNIX operating system, all input and output is done by reading or writing files, 
because all peripheral devices, even the user’s terminal, are files in the file system. This 
means that a single, homogeneous interface handles all communication between a program 
and peripheral devices. 

In the most general case, before reading or writing a file, it is necessary to inform the 
system of your intent to do so, a process called “opening” the file. If you are going to 
write on a file, it may also be necessary to create it. The system checks your right to do so 
(Does the file exist? Do you have permission to access it?), and if all is well, returns a 
small positive integer called a file descriptor. Whenever I/O is to be done on the file, the 
file descriptor is used instead of the name to identify the file. (This is roughly analogous 
to the use of READ(5,.,.) and WRITE(6,...) in Fortran.) All information about an open file is 
maintained by the system; the user program refers to the file only by the file descriptor. 

The file pointers discussed in section 3 are similar in spirit to file descriptors, but file 
descriptors are more fundamental. A file pointer is a pointer to a structure that contains, 
among other things, the file descriptor for the file in question. 

Since input and output involving the user’s terminal are so common, special arrange¬ 
ments exist to make this convenient. When the command interpreter (the “shell”) runs a 
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program, it opens three files, with file descriptors 0, 1, and 2, called the standard input, the 
standard output, and the standard error output. All of these are normally connected to the 
terminal, so if a program reads file descriptor 0 and writes file descriptors 1 and 2, it can 
do terminal I/O without worrying about opening the files. 

If I/O is redirected to and from files with < and > , as in 
prog < infile >outfile 

the shell changes the default assignments for file descriptors 0 and 1 from the terminal to 
the named files. Similar observations hold if the input or output is associated with a pipe. 
Normally file descriptor 2 remains attached to the terminal, so error messages can go there. 
In all cases, the file assignments are changed by the shell, not by the program. The pro¬ 
gram does not need to know where its input comes from nor where its output goes, so long 
as it uses file 0 for input and 1 and 2 for output. 

4.2. Read and Write 

All input and output is done by two functions called read and write. For both, the 
first argument is a file descriptor. The second argument is a buffer in your program where 
the data is to come from or go to. The third argument is the number of bytes to be 
transferred. The calls are 

n-read * read(fd, buf, n); 
n-written = write(fd, buf, n); 

Each call returns a byte count which is the number of bytes actually transferred. On read¬ 
ing, the number of bytes returned may be less than the number asked for, because fewer 
than n bytes remained to be read. (When the file is a terminal, read normally reads only 
up to the next newline, which is generally less than what was requested.) A return value of 
zero bytes implies end of file, and -1 indicates an error of some sort. For writing, the 
returned value is the number of bytes actually written; it is generally an error if this isn’t 
equal to the number supposed to be written. 

The number of bytes to be read or written is quite arbitrary. The two most common 
values are 1, which means one character at a time (“unbuffered”), and 512, which 
corresponds to a physical blocksize on many peripheral devices. This latter size will be 
most efficient, but even character at a time I/O is not inordinately expensive. 

Putting these facts together, we can write a simple program to copy its input to its out¬ 
put. This program will copy anything to anything, since the input and output can be 
redirected to any file or device. 

# define BUFSIZE 512 /• best size for PDP-11 UNIX •/ 

main() /• copy input to output •/ 

I 

char buf[BUFSIZE]; 
int n; 

while ((n « read(0, buf, BUFSIZE)) > 0) 
write(l, buf, n); 
exit(0); 

i 

If the file size is not a multiple of BUFSIZE, some read will return a smaller number of 
bytes to be written by write; the next call to read after that will return zero. 

It is instructive to see how read and write can be used to construct higher level rou¬ 
tines like getchar, putchar, etc. For example, here is a version of getchar which does 
unbuffered input. 
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#define CMASK 0377 /• for making char’s > 0 •/ 

getchar() /• unbuffered single character input •/ 

( 

char c; 

returnf(read(0, &c, 1) > 0) ? c & CMASK : EOF); 

) 

c must be declared char, because read accepts a character pointer. The character being 
returned must be masked with 0377 to ensure that it is positive; otherwise sign extension 
may make it negative. (The constant 0377 is appropriate for the PDP-11 but not neces¬ 
sarily for other machines.) 

The second version ofgetchar does input in big chunks, and hands out the characters 
one at a time. 

#define CMASK 0377 /* for making char’s > 0 */ 

#define BUFSIZE 512 

getchar() /* buffered version */ 

( 

static char buf[BUFSIZE]; 
static char *bufp = buf; 
static int n = 0; 

if (n == 0) | /* buffer is empty */ 
n - read(0, buf, BUFSIZE); 
buf p = buf ; 

) 

return( (- -n >= 0) ? *bufp++ & CMASK : EOF); 


4.3. Open, Creat, Close, Unlink 

Other than the default standard input, output and error files, you must explicitly open 
files in order to read or write them. There are two system entry points for this, open and 
creat [sic]. 

open is rather like the fopen discussed in the previous section, except that instead of 
returning a file pointer, it returns a file descriptor, which is just an int. 

int fd; 

fd ** open(name, rwmode); 

As with fopen, the name argument is a character string corresponding to the external file 
name. The access mode argument is different, however: rwmode is 0 for read, 1 for write, 
and 2 for read and write access, open returns -1 if any error occurs; otherwise it returns a 
valid file descriptor. 

It is an error to try to open a file that does not exist. The entry point creat is pro¬ 
vided to create new files, or to re-write old ones. 

fd = creat(name, pmode) ; 

returns a file descriptor if it was able to create the file called name, and -1 if not. If the 
file already exists, creat will truncate it to zero length; it is not an error to creat a file 
that already exists. 

If the file is brand new, creat creates it with the protection mode specified by the 
pmode argument. In the UNIX file system, there are nine bits of protection information 
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associated with a file, controlling read, write and execute permission for the owner of the 
file, for the owner’s group, and for all others. Thus a three-digit octal number is most con¬ 
venient for specifying the permissions. For example, 0755 specifies read, write and execute 
permission for the owner, and read and execute permission for the group and everyone else. 

To illustrate, here is a simplified version of the UNIX utility cp, a program which 
copies one file to another. (The main simplification is that our version copies only one file, 
and does not permit the second argument to be a directory.) 

#define NULL 0 
#define BUFSIZE 512 

#define PMODE 0644 /* RW for owner, R for group, others */ 

main(argc, argv) /* cp: copy fl to f2 */ 
int argc; 
char *argv[]; 

( 

int f1, f2, n; 
char buf[BUFSIZE]; 

if (argc != 3) 

error ( " Usage: cp from to" , NULL); 
if ((fl = open(argv[ 1 ] , 0)) == -1) 

error(" cp: can’t open %s" , argv[l]); 
if ((f 2 = creat(argv[2], PMODE)) == -1) 

error( " cp: can’t create %s" , argv[2]); 

while ((n = read(fl, buf, BUFSIZE)) > 0) 
if (write(f2, buf, n) != n) 

error("cp: write error" , NULL); 
exit(0); 


error(si, s2) /* print error message and die */ 

char *sl, *s2; 

( 

printf(si, s2); 
printf(" \ n" ); 
exit(1); 

i 

As we said earlier, there is a limit (typically 15-25) on the number of files which a pro¬ 
gram may have open simultaneously. Accordingly, any program which intends to process 
many files must be prepared to re-use file descriptors. The routine close breaks the con¬ 
nection between a file descriptor and an open file, and frees the file descriptor for use with 
some other file. Termination of a program via exit or return from the main program 
closes all open files. 

The function uni ink( filename) removes the file filename from the file system. 

4.4. Random Access — Seek and Lseek 

File I/O is normally sequential: each read or write takes place at a position in the file 
right after the previous one. When necessary, however, a file can be read or written in any 
arbitrary order. The system call lseek provides a way to move around in a file without 
actually reading or writing: 

lseek(fd, offset, origin); 

forces the current position in the file whose descriptor is fd to move to position offset, 
which is taken relative to the location specified by origin. Subsequent reading or writing 
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will begin at that position, offset is a long; fd and origin are int’s. origin can be 
’ to specify that offset is to be measured from the beginning, from the current 

nd of the file respectiveiy - For ~ ampie ’ to ap ^ d * ■ 

lseek(fd, OL, 2); 

To get back to the beginning (“rewind”), 
lseek(fd, OL, 0); 

Notice the OL argument; it could also be written as (long) 0. 

With Iseek, it is possible to treat files more or less like large arrays, at the price of 
s ower access. For example, the following simple function reads any number of bytes from 
any arbitrary place in a file. ^ 


get(fd, pos, buf, n) 
int fd, n; 
long pos; 
char *buf; 


read n bytes from position pos •/ 


lseek(fd, pos, 0); /* get to pos •/ 

return(read(fd, buf, n)); 


In pre-version 7 UNIX, the basic entry point to the I/O system is called seek, seek is 
identical to Iseek except that its offset argument is an int rather than a long 

® m f ce !? P ' n integers have onl y 16 bits - the offset specified for seek is lim- 

k ! MO ;i° r thl8 rea8 ° n > ° rigin Values of 3 ’ 4 ’ 5 caU8e seek to multiply the given 
offset by 512 (the number of bytes in one physical block) and then interpret origin as if 

iZV't’ ° r u e8PeCt ! Vely - L Th L US to get to an arb itrary place in a large file requires two 
seeks, first one which selects the block, then one which has origin equal to 1 and moves 
to the desired byte within the block. 4 68 

4.5. Error Processing 

The routines discussed in this section, and in fact all the routines which are direct 
entries into the system can incur errors. Usually they indicate an error by returning a 
value of 1. Sometimes it is nice to know what sort of error occurred; for this purpose all 
these routines, when appropriate, leave an error number in the external cell errno. The 

Z'l'T ° f the v , an ° U8 error "umbers are listed in the introduction to Section II of the 
IX Programmers Manual, so your program can, for example, determine if an attempt to 
open a file failed because it did not exist or because the user lacked permission to read it 
Perhaps more commonly, you may want to print out the reason for failure. The routine 
perror will print a message associated with the value of errno; more generally 

sys errno is an array of character strings which can be indexed by errno and printed bv 
your program. H y 

5. PROCESSES 

R is often easier to use a program written by someone else than to invent one’s own. 

I his section describes how to execute a program from within another. 

5.1. The “System” Function 

The easiest way to execute a program from another is to use the standard library rou¬ 
tine system, system takes one argument, a command string exactly as typed at the ter¬ 
minal (except for the newline at the end) and executes it. For instance, to time-stamp the 
output of a program, p 
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main() 


system( " date" ); 

/* rest of processing */ 

) 

If the command string has to be built from pieces, the in-memory formatting capabilities 
of sprintf may be useful. 

Remember than getc and putc normally buffer their input; terminal I/O will not be 
properly synchronized unless this buffering is defeated. For output, use f flush; for input, 
see setbuf in the appendix. 

5.2. Low-Level Process Creation — Execl and Execv 

If you’re not using the standard library, or if you need finer control over what happens, 
you will have to construct calls to other programs using the more primitive routines that 
the standard library’s system routine is based on. 

The most basic operation is to execute another program without returning, by using the 
routine execl. To print the date as the last action of a running program, use 

execl("/bin/date" , "date", NULL); 

The first argument to exec 1 is the file name of the command; you have to know where it is 
found in the file system. The second argument is conventionally the program name (that 
is, the last component of the file name), but this is seldom used except as a place-holder. 
If the command takes arguments, they are strung out after this; the end of the list is 
marked by a NULL argument. 

The execl call overlays the existing program with the new one, runs that, then exits. 
There is no return to the original program. 

More realistically, a program might fall into two or more phases that communicate only 
through temporary files. Here it is natural to make the second pass simply an execl call 
from the first. 

The one exception to the rule that the original program never gets control back occurs 
when there is an error, for example if the file can’t be found or is not executable. If you 
don’t know where date is located, say 

execl(" /bin/date" , "date" , NULL); 
execl(" /usr/bin/date" , "date" , NULL); 
fprintf(stderr, "Someone stole •date’Xn"); 

A variant of execl called execv is useful when you don’t know in advance how many 
arguments there are going to be. The call is 

execv(filename, argp); 

where argp is an array of pointers to the arguments; the last pointer in the array must be 
NULL so execv can tell where the list ends. As with execl, f i lename is the file in which 
the program is found, and argp[0] is the name of the program. (This arrangement is 
identical to the argv array for program arguments.) 

Neither of these routines provides the niceties of normal command execution. There is 
no automatic search of multiple directories — you have to know precisely where the com¬ 
mand is located. Nor do you get the expansion of metacharacters like <, > , *, ?, and [ ] 
in the argument list. If you want these, use execl to invoke the shell sh, which then does 
all the work. Construct a string command 1 ine that contains the complete command as it 
would have been typed at the terminal, then Bay 

execl ( " /bln/ sh" , " sh" , " -c" , commandline, NULL); 
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The shell is assumed to be at a fixed place, /bin/sh. Its argument -c says to treat the 
next argument as a whole command line, so it does just what you want. The only problem 
is in constructing the right information in command 1 ine. 

5.3. Control of Processes — Fork and Wait 

So far what we’ve talked about isn’t really all that useful by itself. Now we will show 
how to regain control after running a program with execl or execv. Since these routines 
simply overlay the new program on the old one, to save the old one requires that it first be 
split into two copies; one of these can be overlaid, while the other waits for the new, over¬ 
laying program to finish. The splitting is done by a routine called fork: 

proc-id = fork(); 

splits the program into two copies, both of which continue to run. The only difference 
between the two is the value of proc-id, the “process id.” In one of these processes (the 
child ), proc-id is zero. In the other (the “parent”), proc-id is non-zero; it is the pro¬ 
cess number of the child. Thus the basic way to call, and return from, another program is 

if (fork() == 0) 

execl(" /bin/sh" , " sh" , " -c" , cmd, NULL);/* in child */ 

And in fact, except for handling errors, this is sufficient. The fork makes two copies of 
the program. In the child, the value returned by fork is zero, so it calls execl which does 
the command and then dies. In the parent, fork returns non-zero so it skips the execl 
(If there is any error, fork returns -1). 

More often, the parent wants to wait for the child to terminate before continuing itself. 
This can be done with the function wait: 

int status; 

if (fork() == 0) 
execl(...); 

wait(&status); 

This still doesn t handle any abnormal conditions, such as a failure of the execl or fork, 
or the possibility that there might be more than one child running simultaneously. (The 
wait returns the process id of the terminated child, if you want to check it against the 
value returned by fork.) Finally, this fragment doesn’t deal with any funny behavior on 
the part of the child (which is reported in status). Still, these three lines are the heart of 
the standard library’s system routine, which we’ll show in a moment. 

The status returned by wait encodes in its low-order eight bits the system’s idea of 
the child s termination status; it is 0 for normal termination and non-zero to indicate vari¬ 
ous kinds of problems. The next higher eight bits are taken from the argument of the call 
to ex i t which caused a normal termination of the child process. It is good coding practice 
for all programs to return meaningful status. 

When a program is called by the shell, the three file descriptors 0, 1, and 2 are set up 
pointing at the right files, and all other possible file descriptors are available for use. 
When this program calls another one, correct etiquette suggests making sure the same con¬ 
ditions hold. Neither fork nor the exec calls affects open files in any way. If the parent 
is buffering output that must come out before output from the child, the parent must flush 
its buffers before the execl. Conversely, if a caller buffers an input stream, the called pro- 
gram will lose any information that has been read by the caller. 

5.4. Pipes 

A pipe is an I/O channel intended for use between two cooperating processes: one pro¬ 
cess writes into the pipe, while the other reads. The system looks after buffering the data 
and synchronizing the two processes. Most pipes are created by the shell, as in 
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Is I pr 

which connects the standard output of Is to the standard input of pr. Sometimes, how¬ 
ever, it is most convenient for a process to set up its own plumbing; in this section, we will 
illustrate how the pipe connection is established and used. 

The system call pipe creates a pipe. Since a pipe is used for both reading and writing, 
two file descriptors are returned; the actual usage is like this: 

int fd[2]; 

stat * pipe(fd); 

if (stat == -1) 

/* there was an error ... */ 

fd is an array of two file descriptors, where fd[0] is the read side of the pipe and fd[ 1 ] 
is for writing. These may be used in read, write and close calls just like any other file 
descriptors. 

If a process reads a pipe which is empty, it will wait until data arrives; if a process 
writes into a pipe which is too full, it will wait until the pipe empties somewhat. If the 
write side of the pipe is closed, a subsequent read will encounter end of file. 

To illustrate the use of pipes in a realistic setting, let us write a function called 
popen(cmd, mode), which creates a process cmd (just as system does), and returns a file 
descriptor that will either read or write that process, according to mode. That is, the call 

fout = popen("pr" , WRITE); 

creates a process that executes the pr command; subsequent write calls using the file 
descriptor fout will send their data to that process through the pipe. 

popen first creates the the pipe with a pipe system call; it then forks to create two 
copies of itself. The child decides whether it is supposed to read or write, closes the other 
side of the pipe, then calls the shell (via execl) to run the desired process. The parent 
likewise closes the end of the pipe it does not use. These closes are necessary to make 
end-of-file tests work properly. For example, if a child that intends to read fails to close 
the write end of the pipe, it will never see the end of the pipe file, just because there is one 
writer potentially active. 


14 








# include <stdio.h> 


#define READ 0 

# define WRITE 1 

# def ine tst(a, b) (mode ■*= READ ? (b) : (a)) 
static int popen-pid; 

popen(cmd, mode) 
char *cmd; 
int mode; 

i 

int p[2J; 

if (pipe(p) < 0) 
return(NULL); 

if ((popen-pid = fork()) == 0) ( 
close(tst(p[WRITE], p[READ])); 
close(tst(0, 1)); 
dup(tst(p[READ], p[WRITE])); 
close(tst(p[READ], p[WRITE])); 
execl (" /bin/sh" , " sh" , " -c" , cmd, 0); 

-exit(l); /• disaster has occurred if we get here •/ 

if (popen-pid «= = -1) 
return(NULL); 

close(tst(p[READ], p[WRITE])); 

return(tst(p[WRITE], p[READ])); 

) 

The sequence of closes in the child is a bit tricky. Suppose that the task is to create a 
child process that will read data from the parent. Then the first close closes the write 
side of the pipe, leaving the read side open. The lines 

close(tst(0, 1)); 
dup(tst(p[READ], p[WRITEj)); 

are the conventional way to associate the pipe descriptor with the standard input of the 
child. The close closes file descriptor 0, that is, the standard input, dup is a system call 
that returns a duplicate of an already open file descriptor. File descriptors are assigned in 
increasing order and the first available one is returned, so the effect of the dup is to copy 
the file descriptor for the pipe (read side) to file descriptor 0; thus the read side of the pipe 
becomes the standard input. (Yes, this is a bit tricky, but it’s a standard idiom.) Finally, 
the old read side of the pipe is closed. 

A similar sequence of operations takes place when the child process is supposed to 
write from the parent instead of reading. You may find it a useful exercise to step through 
that case. 

The job is not quite done, for we still need a function pc lose to close the pipe created 
by popen. The main reason for using a separate function rather than close is that it is 
desirable to wait for the termination of the child process. First, the return value from 
pc lose indicates whether the process succeeded. Equally important when a process 
creates several children is that only a bounded number of unwaited-for children can exist, 
even if some of them have terminated; performing the wa 11 lays the child to rest. Thus: 
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#include <signal.h> 


pclose(fd) /* close pipe fd •/ 
int fd; 

i 

register r, (*hstat)(), (*istat)(), (*qstat)(); 

int status; 
extern int popen-pid; 

close(fd); 

istat * signal(SIGINT, SIG-IGN); 
qstat = signal(SIGQUIT, SIG-IGN); 
hstat = signal(SIGHUP, SIG-IGN); 

while ((r * wait(fcstatus)) != popen-pid && r != -1); 

if (r — -1) 

status = -1; 
signal(SIGINT, istat); 
signal(SIGQUIT, qstat); 
signal(SIGHUP, hstat); 
return(status); 


The calls to signal make sure that no interrupts, etc., interfere with the waiting process; 
this is the topic of the next section. 

The routine as written has the limitation that only one pipe may be open at once, 
because of the single shared variable popen-pid; it really should be an array indexed by 
file descriptor. A popen function, with slightly different arguments and return value is 
available as part of the standard I/O library discussed below. As currently written, it 
shares the same limitation. 

6. SIGNALS — INTERRUPTS AND ALL THAT 

This section is concerned with how to deal gracefully with signals from the outside 
world (like interrupts), and with program faults. Since there’s nothing very useful that can 
be done from within C about program faults, which arise mainly from illegal memory refer¬ 
ences or from execution of peculiar instructions, we’ll discuss only the outside-world sig¬ 
nals: interrupt , which is sent when the DEL character is typed; quit , generated by the FS 
character; hangup , caused by hanging up the phone; and terminate, generated by the kill 
command. When one of these events occurs, the signal is sent to all processes which were 
started from the corresponding terminal; unless other arrangements have been made, the 
signal terminates the process. In the quit case, a core image file is written for debugging 
purposes. 

The routine which alters the default action is called signal. It has two arguments: 
the first specifies the signal, and the second specifies how to treat it. The first argument is 
just a number code, but the second is the address is either a function, or a somewhat 
strange code that requests that the signal either be ignored, or that it be given the default 
action. The include file signal.h gives names for the various arguments, and should 
always be included when signals are used. Thus 

#include <signal.h> 

signal(SIGINT, SIG-IGN); 
causes interrupts to be ignored, while 
signal(SIGINT, SIG-DFL); 

restores the default action of process termination. In all cases, signal returns the previ¬ 
ous value of the signal. The second argument to signal may instead be the name of a 
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function (which has to be declared explicitly if the compiler hasn’t seen it already). In this 
case, the named routine will be called when the signal occurs. Most commonly this facility 
is used to allow the program to clean up unfinished business before terminating, for exam¬ 
ple to delete a temporary file: 

# include <signal.h> 

main() 

( 

int onintr(); 

if (signal(SIGINT, SIG-IGN) != SIG-IGN) 
signal(SIGINT, onintr); 

/• Process ... * / 

exit(0); 


onintr() 

. i 

uni ink(tempfile); 
exit(1); 

) 

Why the test and the double call to signal? Recall that signals like interrupt are sent 
to all processes started from a particular terminal. Accordingly, when a program is to be 
run non-interactively (started by &), the shell turns off interrupts for it so it won’t be 
stopped by interrupts intended for foreground processes. If this program began by 
announcing that all interrupts were to be sent to the onintr routine regardless, that would 
undo the shell’s effort to protect it when run in the background. 

The solution, shown above, is to test the state of interrupt handling, and to continue to 
ignore interrupts if they are already being ignored. The code as written depends on the 
fact that signal returns the previous state of a particular signal. If signals were already 
being ignored, the process should continue to ignore them; otherwise, they should be 
caught. 

A more sophisticated program may wish to intercept an interrupt and interpret it as a 
request to stop what it is doing and return to its own command-processing loop. Think of 
a text editor: interrupting a long printout should not cause it to terminate and lose the 
work already done. The outline of the code for this case is probably best written like this: 

# include <signal.h> 

# include <setjmp.h> 
jmp—buf sjbuf; 

main() 

( 

int (*istat)(), onintr(); 

istat = signa1(SIGINT, SIG—IGN); /• save original status •/ 
setjmp(sjbuf); /• save current stack position */ 
if (istat != SIG-IGN) 

signal(SIGINT, onintr); 

/• main processing loop */ 
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onintr() 


printf(" \ nlnterrupt\ n" ); 

longjmp(sjbuf); /* return to saved state */ 

i 

The include file setjmp.h declares the type jmp-buf an object in which the state can be 
saved, sjbuf is such an object; it is an array of some sort. The setjmp routine then 
saves the state of things. When an interrupt occurs, a call is forced to the onintr routine, 
which can print a message, set flags, or whatever, longjmp takes as argument an object 
stored into by setjmp, and restores control to the location after the call to setjmp, so 
control (and the stack level) will pop back to the place in the main routine where the signal 
is set up and the main loop entered. Notice, by the way, that the signal gets set again 
after an interrupt occurs. This is necessary; most signals are automatically reset to their 
default action when they occur. 

Some programs that want to detect signals simply can’t be stopped at an arbitrary 
point, for example in the middle of updating a linked list. If the routine called on 
occurrence of a signal sets a flag and then returns instead of calling exit or longjmp, exe¬ 
cution will continue at the exact point it was interrupted. The interrupt flag can then be 
tested later. 

There is one difficulty associated with this approach. Suppose the program is reading 
the terminal when the interrupt is sent. The specified routine is duly called; it sets its flag 
and returns. If it were really true, as we said above, that “execution resumes at the exact 
point it was interrupted,” the program would continue reading the terminal until the user 
typed another line. This behavior might well be confusing, since the user might not know 
that the program is reading; he presumably would prefer to have the signal take effect 
instantly. The method chosen to resolve this difficulty is to terminate the terminal read 
when execution resumes after the signal, returning an error code which indicates what hap¬ 
pened. 

Thus programs which catch and resume execution after signals should be prepared for 
“errors” which are caused by interrupted system calls. (The ones to watch out for are 
reads from a terminal, wait, and pause.) A program whose onintr program just sets 
intf lag, resets the interrupt signal, and returns, should usually include code like the fol¬ 
lowing when it reads the standard input: 

if (getchar() == EOF) 
if (intflag) 

/• EOF caused by interrupt */ 
else 

/* true end-of-file */ 

A final subtlety to keep in mind becomes important when signal-catching is combined 
with execution of other programs. Suppose a program catches interrupts, and also includes 
a method (like “!” in the editor) whereby other programs can be executed. Then the code 
should look something like this: 

if (fork( ) == 0) 
execl (...)*» 

signal(SIGINT, SIG-IGN); /* ignore interrupts */ 
wait(fcstatus); /* until the child is done •/ 

signal(SIGINT, onintr): /* restore interrupts */ 

Why is this? Again, it’s not obvious but not really difficult. Suppose the program you call 
catches its own interrupts. If you interrupt the subprogram, it will get the signal and 
return to its main loop, and probably read your terminal. But the calling program will also 
pop out of its wait for the subprogram and read your terminal. Having two processes read¬ 
ing your terminal is very unfortunate, since the system figuratively flips a coin to decide 
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who should get each line of input. A simple way out is to have the parent program ignore 
interrupts until the child is done. This reasoning is reflected in the standard I/O library 
function system: 

#include <signal.h> 

system(s) /* run command string s •/ 
char *s; 

i 

int status, pid, w; 

register int (*istat)(), (*qstat)(); 

if ((pid = fork()) «« 0) | 

execl(" /bin/sh" , " sh" , " -c" , s, 0); 

—exit(127); 

) 

istat = signal(SIGINT, SIG-IGN); 

qstat = signal(SIGQUIT, SIG-IGN); 

while ((w = wait(^status)) !*= pid && w !« -1) 

if ( W == -1) 

status = -1; 
signal(SIGINT, istat); 
signal(SIGQUIT, qstat); 
return(status); 

i 

As an aside on declarations, the function signal obviously has a rather strange second 
argument. It is in fact a pointer to a function delivering an integer, and this is also the 
type of the signal routine itself. The two values SIG-IGN and SIG-DFL have the right 
type, but are chosen so they coincide with no possible actual functions. For the enthusiast, 
here is how they are defined for the PDP-11; the definitions should be sufiiciently ugly and 
nonportable to encourage use of the include file. 

#define SIG-DFL (int (•)())<) 

#define SIG-IGN (int (•)()) 1 
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Appendix — The Standard I/O Library 

D. M. Ritchie 

The standard I/O library was designed with the following goals in mind. 

1. It must be as efficient as possible, both in time and in space, so that there will be no 
hesitation in using it no matter how critical the application. 

•2 It must be simple to use, and also free of the magic numbers and mysterious calls 
whose use mars the understandability and portability of many programs using older 
packages. 

3 The interface provided should be applicable on all machines, whether or not the pro¬ 
grams which implement it are directly portable to other systems, or to machines other 
than the PDP-11 running a version of UNIX. 

1. General Usage 

Each program using the library must have the line 
# include <stdio.h> 

which defines certain macros and variables. The routines are in the normal C library, so 
no special library argument is needed for loading. All names in the include file intended 
only for internal use begin with an underscore - to reduce the possibility of collision with a 
user name. The names intended to be visible outside the package are 

s t d i n The name of the standard input file 
stdout The name of the standard output file 
stderr The name of the standard error file 

EOF is actually -1, and is the value returned by the read routines on end-of-file or 

error. 

NULL is a notation for the null pointer, returned by pointer-valued functions to indi¬ 
cate an error 

FILE expands to struct —iob and is a useful shorthand when declaring pointers to 
streams. 

BUFSIZ is a number (viz. 512) of the size suitable for an I/O buffer supplied by the user. 
See setbuf, below. 

getc, getchar, putc, putchar, feof, terror, fileno 

are defined as macros. Their actions are described below; they are mentioned 
here to point out that it is not possible to redeclare them and that they are not 
actually functions; thus, for example, they may not have breakpoints set on 
them. 

The routines in this package offer the convenience of automatic buffer allocation and 
output flushing where appropriate. The names stdin, stdout, and stderr are in effect 
constants and may not be assigned to. 

2. Calls 

FILE *fopen(filename, type) char ’filename, ’type; 

opens the file and, if needed, allocates a buffer for it. filename is a character string 
specifying the name, type is a character string (not a single character). It may be 
" r" , " w" , or " a" to indicate intent to read, write, or append. The value returned 
is a file pointer. If it is NULL the attempt to open failed. 

FILE ’ freopen( filename, type, ioptr) char ’filename, ’type; FILE ’ioptr 


20 







The stream named by ioptr is closed, if necessary, and then reopened as if by fopen. 
If the attempt to open fails, NULL is returned, otherwise ioptr, which will now refer to 
the new file. Often the reopened stream is stdin or stdout. 

int getc(ioptr) FILE *ioptr; 

returns the next character from the stream named by ioptr, which is a pointer to a 
file such as returned by fopen, or the name stdin. The integer EOF is returned on 
end-of-file or when an error occurs. The null character \ 0 is a legal character. 

int fgetc(ioptr) FILE *ioptr; 

acts like getc but is a genuine function, not a macro, so it can be pointed to, passed as 
an argument, etc. 

putc(c, ioptr) FILE *ioptr; 

putc writes the character c on the output stream named by ioptr, which is a value 
returned from fopen or perhaps stdout or stderr. The character is returned as 
value, but EOF is returned on error. 

fputc(c, ioptr) FILE *ioptr; 

acts like putc but is a genuine function, not a macro. 

fclose(ioptr) FILE *ioptr; 

The file corresponding to ioptr is closed after any buffers are emptied. A buffer allo¬ 
cated by the I/O system is freed, fclose is automatic on normal termination of the 
program. 

fflush(ioptr) FILE *ioptr; 

Any buffered information on the (output) stream named by ioptr is written out. Out¬ 
put files are normally buffered if and only if they are not directed to the terminal; how¬ 
ever, stderr always starts off unbuffered and remains so unless setbuf is used, or 
unless it is reopened. 

exit(errcode); 

terminates the process and returns its argument as status to the parent. This is a spe¬ 
cial version of the routine which calls fflush for each output file. To terminate 
without flushing, use —exit. 

feof(ioptr) FILE *ioptr; 

returns non-zero when end-of-file has occurred on the specified input stream. 
ferror(ioptr) FILE *ioptr; 

returns non-zero when an error has occurred while reading or writing the named 
stream. The error indication lasts until the file has been closed. 

getchar(); 

is identical to getc (stdin). 
putchar(c); 

is identical to putc(c , stdout). 

char *fgets(s, n, ioptr) char *s; FILE *ioptr; 

reads up to n -1 characters from the stream ioptr into the character pointer s. The 
read terminates with a newline character. The newline character is placed in the buffer 
followed by a null character, fgets returns the first argument, or NULL if error or 
end-of-file occurred. 

fputs(s, ioptr) char *s; FILE *ioptr; 

writes the null-terminated string (character array) s on the stream ioptr. No newline 
is appended. No value is returned. 

ungetc(c, ioptr) FILE *ioptr; 

The argument character c is pushed back on the input stream named by ioptr. Only 


21 







one character may be pushed back. 

printf(format, al, ...) char ‘format; 

fprintf(ioptr, format, al, ...) FILE ‘ioptr; char ‘format; 

sprintf(s, format, al, ...)char *s, ‘format; 

printf writes on the standard output, fprintf writes on the named output stream, 
sprintf puts characters in the character array (string) named by s. The 
specifications are as described in section printf (3) of the UNIX Programmer’s Manual. 

scanf(format, al, ...) char ‘format; 

fscanf(ioptr, format, al, ...) FILE ‘ioptr; char ‘format; 

sscanf(s, format, al, •••) char *s, ‘format; 

scanf reads from the standard input, fscanf reads from the named input stream, 
s scanf reads from the character string supplied as s. scanf reads characters, inter¬ 
prets them according to a format, and stores the results in its arguments. Each routine 
expects as arguments a control string format, and a set of arguments, each of which 
must be a pointer, indicating where the converted input should be stored. 

scanf returns as its value the number of successfully matched and assigned input 
items. This can be used to decide how many input items were found. On end of file, 
EOF is returned; note that this is different from 0, which means that the next input 
character does not match what was called for in the control string. 

fread(ptr, sizeof(*ptr), nltems, ioptr) FILE ‘ioptr; 

reads ni terns of data beginning at ptr from file ioptr. No advance notification that 
binary I/O is being done is required; when, for portability reasons, it becomes required, 
it will be done by adding an additional character to the mode-string on the f open call. 

fwrite(ptr, sizeof(*ptr), nitems, ioptr) FILE ‘ioptr; 

Like f read, but in the other direction. 

' 

rewind(ioptr) FILE *ioptr; 

rewinds the stream named by ioptr. It is not very useful except on input, since a 
rewound output file is still open only for output. 

system(string) char 'string; 

The string is executed by the shell as if typed at the terminal. 

getw(ioptr) FILE 'ioptr; 

returns the next word from the input stream named by ioptr. EOF is returned on 
end-of-file or error, but since this a perfectly good integer feof and f error should be 
used. A “word” is 16 bits on the PDP-11. 

putw(w, ioptr) FILE ‘ioptr; 

writes the integer w on the named output stream. 

setbuf(ioptr, buf) FILE ‘ioptr; char *buf; 

setbuf may be used after a stream has been opened but before I/O has started. If 
buf is NULL, the stream will be unbuffered. Otherwise the buffer supplied will be used. 
It must be a character array of sufficient size: 

char buf[BUFSIZ]; 

fi leno(ioptr) FILE ‘ioptr; 

returns the integer file descriptor associated with the file. 

fseek(ioptr, offset, ptrname) FILE ‘ioptr; long offset; 

The location of the next byte in the stream named by ioptr is adjusted, offset is a 
long integer. If ptrname is 0, the offset is measured from the beginning of the file; if 
ptrname is 1, the offset is measured from the current read or write pointer; if 
ptrname is 2, the offset is measured from the end of the file. The routine accounts 
properly for any buffering. (When this routine is used on non-UNIX systems, the offset 
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must be a value returned from ftel 1 and the ptrname must be 0). 
long ftell(ioptr) FILE *ioptr; 

The byte offset, measured from the beginning of the file, associated with the named 
stream is returned. Any buffering is properly accounted for. (On non-UNIX systems 
the value of this call is useful only for handing to f seek, so as to position the file to 
the same place it was when f tel 1 was called.) 

getpw(uid, buf) char *buf; 

The password file is searched for the given integer user ID. If an appropriate line is 
found, it is copied into the character array buf, and 0 is returned. If no line is found 
corresponding to the user ID then 1 is returned. 

char *ma1loc(num); 

allocates num bytes. The pointer returned is sufficiently well aligned to be usable for 
any purpose. NULL is returned if no space is available. 

char *calloc(num, size); 

allocates space for num items each of size size. The space is guaranteed to be set to 0 
and the pointer is sufficiently well aligned to be usable for any purpose. NULL is 
returned if no space is available . 

cfree(ptr) char *ptr; 

Space is returned to the pool used by cal loc. Disorder can be expected if the pointer 
was not obtained from cal loc. 

The following are macros whose definitions may be obtained by including < ctype. h> . 
isalpha(c) returns non-zero if the argument is alphabetic, 
isupper(c) returns non-zero if the argument is upper-case alphabetic, 
is lower (c) returns non-zero if the argument is lower-case alphabetic. 
isdigit(c) returns non-zero if the argument is a digit. 

isspace(c) returns non-zero if the argument is a spacing character: tab, newline, carriage 
return, vertical tab, form feed, space. 

ispunct(c) returns non-zero if the argument is any punctuation character, i.e., not a 
space, letter, digit or control character. 

isalnum(c) returns non-zero if the argument is a letter or a digit. 

isprint(c) returns non-zero if the argument is printable — a letter, digit, or punctuation 
character. 

iscntrl(c) returns non-zero if the argument is a control character. 

isascii(c) returns non-zero if the argument is an ascii character, i.e., less than octal 

0200. 

toupper (c) returns the upper-case character corresponding to the lower-case letter c. 
tolower (c) returns the lower-case character corresponding to the upper-case letter c. 
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PART 2: LANGUAGES 


This part includes articles on the C language and the M4 language preprocessor available on 
ULTRIX-32m. These articles are authoritative reference materials appropriate for people 
familiar with programming in the languages described. Each article defines the implementa- 
tion of a language or preprocessor on the ULTRIX-32m 
system. With the exception of the article on M4, these articles are not tutorial, and they are 
not for beginners. 


C Language 

'The C Programming Language — Reference Manual” lists in detail the rules, conventions, 
and concepts that define the implementation of C on the VAX computer. This is reprinted 
from an appendix in The C Programming Language / by Kernighan and Ritchie. Before you 
use this article, you should know how to write programs in C and have read The C Program¬ 
ming Language. 


Kernighan, Brian W. and Ritchie, Dennis M., The C Programming Language, Prentice 
Hall, Englewood Cliffs, N.J., 1978. 


The next two articles describe C language compilers. "A Tour through the Portable C Com¬ 
piler,” by Johnson, explains the Berkeley C compiler available in the ULTRIX-32m system. 
It tells what happens when you compile a C program on ULTRIX-32m and is meant for 
people who may support the C compiler. This article gives an excellent overview of the 
organization, operation, and background of the ULTRIX-32m C compiler. The Ritchie article, 
"A Tour through the UNIX C Compiler,” describes the Bell UNIX C compiler, not imple¬ 
mented on ULTRIX-32m. 


M4 

M4 is a macro processor that provides string substitution. It accepts as input source code in 
any computer language and substitutes a defined text for each occurrence of a macro name. 
"The M4 Macro Processor,” by Kernighan and Ritchie, offers readable explanations and good 
examples. You can use M4 to: 

• Set up your own macros 

• Create and use macros that take several arguments 

• Use a set of built-in macros 

• Bring in new files with an include function 

• Call shell functions with a system command 
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The C Programming Language — Reference Manual 

Dennis M. Riichie 

Bell Laboratories. Murray Hill. New Jersey 


This manual is reprinted, with minor changes, from The C Programming Language, by Brian W. Ker- 
nighan and Dennis M. Riichie, Prentice-Mall, Inc., 1978. 


1. Introduction 

This manual describes the C language on the DEC PDP-11, the DHC VAX-11, the Honeywell 6000. 
the IBM System/370, and the Interdata 8/32. Where differences exist, it concentrates on the POP-11. but 
tries to point out implementation-dependent details. With few exceptions, these dependencies follow 
directly from the underlying properties of the hardware: the various compilers are generally quite compa¬ 
tible. 


2. Lexical conventions 

There are six classes of tokens: identifiers, keywords, constants, strings, operators, and other separa¬ 
tors. Blanks, tabs, newlines, and comments (collectively, “white space") as described below are ignored 
except as they serve to separate tokens. Some white space is required to separate otherwise adjacent 
identifiers, keywords, and constants. 

If the input stream has been parsed into tokens up to a given character, the next token is taken to 
include the longest siring of characters which could possibly constitute a token. 


2.1 Comments 

The characters /* introduce a comment, which terminates with the characters */. Comments do not 
nest. 


2.2 Identifiers (Names) 

An identifier is a sequence of letters and digits: the first character must be a letter. The underscore _ 
counts as a letter. Upper and lower case letters are different. No more than the first eight characters are 
significant, although more may be used. External identifiers, which are used by various assemblers and 
loaders, are more restricted: 


DEC PDP-11 
DEC VAX-11 
Honeywell 6000 
IBM 360/370 
Interdata 8/32 


7 characters. 2 cases 

8 characters, 2 cases 

6 characters. I case 

7 characters. 1 case 

8 characters. 2 cases 


2.3 Keywords 

The following identifiers are reserved for use as keywords, and may not be used otherwise: 


int 

extern 

else 

char 

register 

for 

float 

typedef 

do 

double 

static 

while 

struct 

goto 

switch 

union 

return 

case 

long 

sizeof 

default 

short 

break 

entry 

unsigned 

continue 


auto 

if 



The entry keyword is not currently implemented by any compiler but is reserved for future use. Some 

t UNIX is a Trademark of Bell Laboraiories. 
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implementations also reserve the words fortran and asm. 

2.4 Constants 

There arc several kinds of constants, as listed below. Hardware characteristics which alTcct sizes arc 
summarized in §2.6. 

2.4.1 Integer constants 

An integer constant consisting of a sequence of digits is taken to be octal il it begins with 0 <digit 
zero), decimal otherwise. The digits 8 and 9 have octal value 10 and II respectively. A sequence of 
digits preceded by Ox or OX (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits 
include a or A through f or F with values 10 through 15. A decimal constant whose value exceeds the 
largest signed machine integer is taken to be long: an octal or hex constant which exceeds the largest 
unsigned machine integer is likewise taken to be long. 

2.4.2 Explicit long constants 

A decimal, octal, or hexadecimal integer constant immediately followed by l (letter ell) or L is a long 
constant. As discussed below, on some machines integer and long values may be considered identical. 

2.4.3 Character constants 

A character constant is a character enclosed in single quotes, as in 'x'. The value of a character 
constant is the numerical value of the character in the machine's character set. 

Certain non-graphic characters, the single quote ' and the backslash \. may be represented according 
to the following table of escape sequences: 


newline 

NL (Ll ; > 

\n 

horizontal lab 

1 IT 

\t 

backspace 

BS 

\b 

carriage return 

CK 

\r 

form feed 

IT* 

\f 

backslash 

\ 

\\ 

single quote 


V 

bit pattern 

ddd 

\ddd 


The escape \ ddd consists of the backslash followed by l. 2, or 3 octal digits which are taken to specify the 
value of the desired character. A special case of this construction is \0 (not followed by a digit), which 
indicates the character NUL. If the character following a backslash is not one of those specified, the 
backslash is ignored. 

2.4.4 Floating constants 

A floating constant consists of an integer part, a decimal point, a fraction part, an e or E. and an 
optionally signed integer exponent. The integer and fraction parts both consist of a sequence of digits. 
Either the integer part or the fraction part (not both) may be missing: either the decimal point or the e 
and the exponent (not both) may be missing. Every floating constant is taken to be double-precision. 

2.5 Strings 

A string is a sequence of characters surrounded by double quotes, as in "...A string has type 
“array of characters" and storage class static (see §4 below) and is initialized with the given characters. 
All strings, even when written identically, are distinct. The compiler places a null byte \0 at the end of 
each siring so that programs which scan the string can find its end. In a string, the double quote charac¬ 
ter •• must be preceded by a \: in addition, the same escapes as described for character constants may be 
used. Finally, a \ and an immediately following newline are ignored. 

2.6 Hardware characteristics 

The following table summarizes certain hardware properties which vary from machine to machine. 
Although these affect program portability, in practice they are less of a problem than might be thought a 
priori. 
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1)1 ( IM5M 1 

Honeywell 60(H) 

HIM 370 

Inierduta 8/32 


ASCII 

\SCII 

1 IK l)l< 

ASCII 

char 

8 bits 

9 bits 

8 bits 

8 bits 

int 

16 

36 

32 

32 

short 

16 

36 

16 

16 

long 

32 

36 

32 

32 

float 

32 

36 

32 

32 

double 

64 

72 

64 

64 

range 

±I0±' K 

±I0±.»* 

± 10 ± 76 

±10- 7,1 


The va\-11 is identical to the PDP-1 I except that integers have 32 bits. 

3. Syntax notation 

In the syntax notation used in this manual, syntactic categories are indicated by italic type, and literal 
words and characters in bold type. Alternative categories are listed on separate lines. An optional ter¬ 
minal or non-terminal symbol is indicated by the subscript “opt." so that 

{ expression I 

indicates an optional expression enclosed in braces. The syntax is summarized in §18. 

4. What’s in a name? 

C bases the interpretation of an identifier upon two attributes of the identifier: its storage class and its 
type The storage class determines the location and lifetime of the storage associated with an identifier, 
the type determines the meaning of the values found in the identifier's storage. 

There are four declarable storage classes: automatic, static, external, and register. Automatic vari¬ 
ables are local to each invocation of a block (§9.2), and are discarded upon exit from the block: static 
variables are local to a block, but retain their values upon reentry to a block even after control has left 
the block: external variables exist and retain their values throughout the execution of the entire program, 
and may be used for communication between functions, even separately compiled functions. Register 
variables are (if possible) stored in the fast registers of the machine: like automatic variables they are 
local to each block and disappear on exit from the block. 

C supports several fundamental types of objects: 

Objects declared as characters (char) are large enough to store any member of the implementation's 
character set, and if a genuine character from that character set is stored in a character variable, its value 
is equivalent to the integer code for that character. Other quantities may be stored into character vari¬ 
ables, but the implementation is machine-dependent. 

Up to three sizes of integer, declared short int, int, and long int, are available. Longer 
integers provide no less storage than shorter ones, but the implementation may make either short 
integers, or long integers, or both, equivalent to plain integers. “Plain" integers have the natural size 
suggested by the host machine architecture: the other sizes are provided to meet special needs. 

Unsigned integers, declared unsigned, obey the laws of arithmetic modulo 2 n where n is the 
number of bits in the representation. (On the POP-11, unsigned long quantities are not supported.) 

Single-precision floating point (float) and double-precision floating point (double) may be 
synonymous in some implementations. 

Because objects of the foregoing types can usefully be interpreted as numbers, they will be referred 
to as arithmetic types. Types char and int of all sizes will collectively be called integral types, float 
and double will collectively be called floating types. 

Besides the fundamental arithmetic types there is a conceptually infinite class of derived types con¬ 
structed from the fundamental types in the following ways: 

arrays of objects of most types: 

functions which return objects of a given type: 

pointers to objects of a given type: 

structures containing a sequence of objects of various types: 

unions capable of containing any one of several objects of various types. 

In general these methods of constructing objects can be applied recursively. 
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5. Objects and lvalues 

An object is a manipulatable region of storage; an lvalue is an expression referring to an object. An 
obvious example of an lvalue expression is an identifier. There are operators which yield lvalues: for 
example, if E is an expression of pointer type, then *E is an lvalue expression referring to the object to 
which E points. The name “lvalue" comes from the assignment expression El - E2 in which the left 
operand El must be an lvalue expression. The discussion of each operator below indicates whether it 
expects lvalue operands and whether it yields an lvalue. 

6. Conversions 

A number of operators may, depending on their operands, cause conversion of the value of an 
operand from one type to another. This section explains the result to be expected from such conver¬ 
sions. §6.6 summarizes the conversions demanded by most ordinary operators; it will be supplemented as 
required by the discussion of each operator. 

6.1 Characters and integers 

A character or a short integer may be used wherever an integer may be used. In all cases the value 
is converted to an integer. Conversion of a shorter integer to a longer always involves sign extension; 
integers are signed quantities. Whether or not sign-extension occurs for characters is machine dependent, 
but it is guaranteed that a member of the standard character set is non-negative. Of the machines treated 
by this manual, only the PDP-11 sign-extends. On the PDP-11, character variables range in value from 
-128 to 127; the characters of the ASCII alphabet are all positive. A character constant specified with an 
octal escape suffers sign extension and may appear negative; for example, ' \377' has the value -1. 

When a longer integer is converted to a shorter or to a char, it is truncated on the left; excess bits 
are simply discarded. 

6.2 Float and double 

All floating arithmetic in C is carried out in double-precision; whenever a float appears in an 
expression it is lengthened to double by zero-padding its fraction. When a double must be converted 
to float, for example by an assignment, the double is rounded before truncation to float length. 

6.3 Floating and integral 

Conversions of floating values to integral type tend to be rather machine-dependent; in particular the 
direction of truncation of negative numbers varies from machine to machine. The result is undefined if 
the value will not fit in the space provided. 

Conversions of integral values to floating type are well behaved. Some loss of precision occurs if the 
destination lacks sufficient bits. 

6.4 Pointers and integers 

An integer or long integer may be added to or subtracted from a pointer; in such a case the first is 
converted as specified in the discussion of the addition operator. 

Two pointers to objects of the same type may be subtracted; in this case the result is converted to an 
integer as specified in the discussion of the subtraction operator. 

6.5 Unsigned 

Whenever an unsigned integer and a plain integer are combined, the plain integer is converted to 
unsigned and the result is unsigned. The value is the least unsigned integer congruent to the signed 
integer (modulo 2 words,zc ). In a 2’s complement representation, this conversion is conceptual and there is 
no actual change in the bit pattern. 

When an unsigned integer is converted to long, the value of the result is the same numerically as 
that of the unsigned integer. Thus the conversion amounts to padding with zeros on the left. 

6.6 Arithmetic conversions 

A great many operators cause conversions and yield result types in a similar way. This pattern will 
be called the “usual arithmetic conversions." 

First, any operands of type char or short are converted to int, and any of type float are con¬ 
verted to double. 




Then, if either operand is double, the other is converted to double and that is the type of the 

result. 

Otherwise, if either operand is long, the other is converted to long and that is the type of the 

result. 

Otherwise, if either operand is unsigned, the other is converted to unsigned and that is the type 

of the result. 

Otherwise, both operands must be int, and that is the type of the result. 

7. Expressions 

The precedence of expression operators is the same as the order of the major subsections of this sec¬ 
tion, highest precedence first. Thus, for example, the expressions referred to as the operands of ♦ (§7.4) 
are those expressions defined in §§7.1 -7.3. Within each subsection, the operators have the same pre¬ 
cedence. Left- or right-associativity is specified in each subsection for the operators discussed therein. 
The precedence and associativity of all the expression operators is summarized in the grammar of §18. 

Otherwise the order of evaluation of expressions is undefined. In particular the compiler considers 
itself free to compute subexpressions in the order it believes most efficient, even if the subexpressions 
involve side effects. The order in which side effects take place is unspecified. Expressions involving a 
commutative and associative operator (*,+,&, I, *) may be rearranged arbitrarily, even in the presence 
of parentheses; to force a particular order of evaluation an explicit temporary must be used. 

The handling of overflow and divide check in expression evaluation is machine-dependent. All exist¬ 
ing implementations of C ignore integer overflows; treatment of division by 0, and ail floating-point 
exceptions, varies between machines, and is usually adjustable by a library function. 

7.1 Primary expressions 

Primary expressions involving ., ->, subscripting, and function calls group left to right. 

primary-expression: 

identifier 

constant 

string 

( expression ) 

primary-expression [ expression ] 
primary-expression ( expression-list^ ) 
primary-lvalue . identifier 
primary-expression -> identifier 

expression-list: 

expression 

expression-list , expression 

An identifier is a primary expression, provided it has been suitably declared as discussed below. Its type 
is specified by its declaration. If the type of the identifier is “array of ...", however, then the value of 
the identifier-expression is a pointer to the first object in the array, and the type of the expression is 
“pointer to ...“. Moreover, an array identifier is not an lvalue expression. Likewise, an identifier which 
is declared “function returning ...", when used except in the function-name position of a call, is con¬ 
verted to “pointer to function returning ...“. 

A constant is a primary expression. Its type may be int, long, or double depending on its form. 
Character constants have type int; floating constants are double. 

A string is a primary expression. Its type is originally “array of char"; but following the same rule 
given above for identifiers, this is modified to “pointer to char" and the result is a pointer to the first 
character in the string. (There is an exception in certain initializers; see §8.6.) 

A parenthesized expression is a primary expression whose type and value are identical to those of the 
unadorned expression. The presence of parentheses does not affect whether the expression is an lvalue. 

A primary expression followed by an expression in square brackets is a primary expression. The 
intuitive meaning is that of a subscript. Usually, the primary expression has type “pointer to ...", the 
subscript expression is int, and the type of the result is “...". The expression El [E2] is identical (by 
definition) to *( (El ) + (E2)). All the clues needed to understand this notation are contained in this sec¬ 
tion together with the discussions in §§ 7.1, 7.2, and 7.4 on identifiers, *, and + respectively; §14.3 below 
summarizes the implications. 
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A function call is a primary expression followed by parentheses containing a possibly empty, 
comma-separated list of expressions which constitute the actual arguments to the function. The primary 
expression must be of type “function returning ...", and the result of the function call is of type 
As indicated below, a hitherto unseen identifier followed immediately by a left parenthesis is contextually 
declared to represent a function returning an integer: thus in the most common case, integer-valued 
functions need not be declared. 

Any actual arguments of type float are converted to double before the call: any of type char or , 
short are converted to int: and as usual, array names are converted to pointers. No other conversions 
are performed automatically: in particular, the compiler does not compare the types of actual arguments 
with those of formal arguments. If conversion is needed, use a cast: see §7.2. 8.7. 

In preparing for the call to a function, a copy is made of each actual parameter: thus, all argument- 
passing in C is strictly by value. A function may change the values of its formal parameters, but these 
changes cannot affect the values of the actual parameters. On the other hand, it is possible to pass a 
pointer on the understanding that the function may change the value of the object to which the pointer 
points. An array name is a pointer expression. The order of evaluation of arguments is undefined by the 
language: lake note that the various compilers differ. 

Recursive calls to any function are permitted. 

A primary expression followed by a dot followed by an identifier is an expression. The first expres¬ 
sion must be an lvalue naming a structure or a union, and the identifier must name a member of the 
structure or union. The result is an lvalue referring to the named member of the structure or union. 

A primary expression followed by an arrow (built from a - and a >) followed by an identifier is an 
expression. The first expression must be a pointer to a structure or a union and the identifier must name 
a member of that structure or union. The result is an lvalue referring to the named member of the struc¬ 
ture or union to which the pointer expression points. 

Thus the expression El->MOS is the same as (*E1) ,MOS. Structures and unions are discussed in 
§8.5. The rules given here for the use of structures and unions are not enforced strictly, in order to allow 
an escape from the typing mechanism. See §14.1. 

7.2 Unary operators 

Expressions with unary operators group right-to-lefi. 

unary-expression: 

* expression 
& lvalue 

- expression 
! expression 

* expression 
++ lvalue 

— lvalue 
lvalue ++ 
lvalue — 

( type-name ) expression 
sizeof expression 
sizeof ( type-name ) 

The unary * operator means indirection : the expression must be a pointer, and the result is an lvalue 
referring to the object to which the expression points. If the type of the expression is “pointer to 
the type of the result is “...". 

The result of the unary & operator is a pointer to the object referred to by the lvalue. If the type of 
the lvalue is “...", the type of the result is “pointer to ...". 

The result of the unary - operator is the negative of its operand. The usual arithmetic conversions 
are performed. The negative of an unsigned quantity is computed by subtracting its value from 2". 
where n is the number of bits in an int. There is no unary + operator. 

The result of the logical negation operator ! is 1 if the value of its operand is 0. 0 if the value of its 
operand is non-zero. The type of the result is int. It is applicable to any arithmetic type or to pointers. 

The * operator yields the one's complement of its operand. The usual arithmetic conversions are 
performed. The type of the operand must be integral. 

The object referred to by the lvalue operand of prefix ++ is incremented. The value is the new value 
of the operand, but is not an lvalue. The expression ++x is equivalent to x+*1. See the discussions of 
addition (§7.4) and assignment operators (§7 14) for information on conversions. 
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The lvalue operand of prefix — is decremented analogously 10 the prefix ++ operator. 

When postfix ++ is applied to an lvalue the result is the value of the object referred to by the lvalue. 
After the result is noted, the object is incremented in the same manner as for the prefix ++ operator. 
The type of the result is the same as the type of the lvalue expression. 

When postfix — is applied to an lvalue the result is the value of the object referred to by the lvalue. 
After the result is noted, the object is decremented in the manner as for the prefix — operator. The type 
of the result is the same as the type of the lvalue expression. 

An expression preceded by the parenthesized name of a data type causes conversion of the value of 
the expression to the named type. This construction is called a cast. Type names are described in §8.7. 

The sizeof operator yields the size, in bytes, of its operand. (A byte is undefined by the language 
except in terms of the value of sizeof. However, in all existing implementations a byte is the space 
required to hold a char.) When applied to an array, the result is the total number of bytes in the array. 
The size is determined from the declarations of the objects in the expression. This expression is semanti¬ 
cally an integer constant and may be used anywhere a constant is required. Its major use is in communi¬ 
cation with routines like storage allocators and I/O systems. 

The sizeof operator may also be applied to a parenthesized type name. In that case it yields the 
size, in bytes, of an object of the indicated type. 

The construction sizeof {type) is taken to be a unit, so the expression sizeof (type) -2 is the 
same as (sizeof {type) ) -2. 

7.3 Multiplicative operators 

The multiplicative operators *, /, and % group left-to-right. The usual arithmetic conversions are 
performed. 

multiplicative-expression: 

expression ★ expression 
expression / expression 
expression % expression 

The binary * operator indicates multiplication. The ★ operator is associative and expressions with 
several multiplications at the same level may be rearranged by the compiler. 

The binary / operator indicates division. When positive integers are divided truncation is toward 0, 
but the form of truncation is machine-dependent if either operand is negative. On all machines covered 
by this manual, the remainder has the same sign as the dividend. It is always true that (a/b) *b + a%b 
is equal to a (if b is not 0). 

The binary % operator yields the remainder from the division of the first expression by the second. 
The usual arithmetic conversions are performed. The operands must not be float. 

7.4 Additive operators 

The additive operators + and - group left-to-right. The usual arithmetic conversions are performed. 
There are some additional type possibilities for each operator. 

additive-expression: 

expression + expression 
expression - expression 

The result of the + operator is the sum of the operands. A pointer to an object in an array and a value of 
any integral type may be added. The latter is in all cases converted to an address offset by multiplying it 
by the length of the object to which the pointer points. The result is a pointer of the same type as the 
original pointer, and which points to another object in the same array, appropriately offset from the origi¬ 
nal object. Thus if P is a pointer to an object in an array,-the expression P+1 is a pointer to the next 
object in the array. 

No further type combinations are allowed for pointers. 

The + operator is associative and expressions with several additions at the same level may be rear¬ 
ranged by the compiler. 

The result of the - operator is the difference of the operands. The usual arithmetic conversions are 
performed. Additionally, a value of any integral type may be subtracted from a pointer, and then the 
same conversions as for addition apply. 

If two pointers to objects of the same type are subtracted, the result is converted (by division by the 
length of the object) to an int representing the number of objects separating the pointed-to objects. 
This conversion will in general give unexpected results unless the pointers point to objects in the same 
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array, since pointers, even to objects of the same type, do not necessarily differ by a multiple of the 
object-length. 

7.5 Shift operators 

The shift operators « and » group left-to-right. Both perform the usual arithmetic conversions on 
their operands, each of which must be integral. Then the right operand is converted to int; the type of 
the result is that of the left operand. The result is undefined if the right operand is negative, or greater 
than or equal to the length of the object in bits. 

shift-expression: 

expression « expression 
expression » expression 

The value of E1«E2 is El (interpreted as a bit pattern) left-shifted E2 bits; vacated bits are 0-filled. 
The value of E1»E2 is El right-shifted E2 bit positions. The right shift is guaranteed to be logical (0- 
fill) if El is unsigned; otherwise it may be (and is, on the PDP-11) arithmetic (fill by a copy of the sign 
bit). 

7.6 Relational operators 

The relational operators group left-to-right, but this fact is not very useful; a.<b<c does not mean 
what it seems to. 

relational-expression: 

expression < expression 
expression > expression 
expression <■ expression 
expression >- expression 

The operators < (less than), > (greater than). <- (less than or equal to) and >■ (greater than or equal to) 
all yield 0 if the specified relation is false and 1 if it is true. The type of the result is int. The usual 
arithmetic conversions are performed. Two pointers may be compared; the result depends on the relative 
locations in the address space of the pointed-to objects. Pointer comparison is portable only when the 
pointers point to objects in the same array. 

7.7 Equality operators 

equality-expression: 

expression -- expression 
expression ! - expression 

The — (equal to) and the l- (not equal to) operators are exactly analogous to the relational operators 
except for their lower precedence. (Thus *<b c<d is 1 whenever a<b and c<d have the same 
truth-value). 

A pointer may be compared to an integer, but the result is machine dependent unless the integer is 
the constant 0. A pointer to which 0 has been assigned is guaranteed not to point to any object, and will 
appear to be equal to 0; in conventional usage, such a pointer is considered to be null. 

7.8 Bitwise AND operator 

and-expression: 

expression a expression 

The & operator is associative and expressions involving a may be rearranged. The usual arithmetic 
conversions are performed; the result is the bitwise AND function of the operands. The operator applies 
only to integral operands. 

7.9 Bitwise exclusive OR operator 

exclusive- or-expression: 

expression A expression 

The * operator is associative and expressions involving * may be rearranged. The usual arithmetic 
conversions are performed; the result is the bitwise exclusive OR function of the operands. The operator 
applies only to integral operands. 
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7.10 Bitwise inclusive OR operator 

inclusi ve - or-express to n: 

expression I expression 

The I operator is associative and expressions involving I may be rearranged. The usual arithmetic 
conversions are performed: the result is the bitwise inclusive OK function of its operands. The operator 
applies only to integral operands. 

7.11 Logical AND operator 

logic a I-and- expression: 

expression && expression 

The && operator groups left-to-righi. It returns 1 if both its operands are non-zero, 0 otherwise Unlike 
&. && guarantees left-to-righi evaluation: moreover the second operand is not evaluated if the first 
operand is 0. 

The operands need not have the same type, but each must have one of the fundamental types or bo 
a pointer. The result is always int. 

7.12 Logical OR operator 

log tea I- or-express to n: 

expression I I expression 

The I I operator groups left-to-righi. It returns I if either of its operands is non-zero, and 0 otherwise. 
Unlike I. I l guarantees lefi-to-right evaluation: moreover, the second operand is not evaluated if the 
value of the first operand is non-zero. 

The operands need not have the same type, but each must have one of the fundamental types or be 
a pointer. The result is always int. 

7.13 Conditional operator 

conditional-expression: 

expression ? expression : expression 

Conditional expressions group right-to-left. The first expression is evaluated and if it is non-zero, the 
result is the value of the second expression, otherwise that of third expression. If possible, the usual 
arithmetic conversions are performed to bring the second and third expressions to a common type; other¬ 
wise, if both are pointers of the same type, the result has the common type: otherwise, one must be a 
pointer and the other the constant 0, and the result has the type of the pointer. Only one of the second 
and third expressions is evaluated. 

7.14 Assignment operators 

There are a number of assignment operators, all of which group right-to-left. All require an lvalue as 
their left operand, and the type of an assignment expression is that of its left operand. The value is the 
value stored in the left operand after the assignment has taken place. The two parts of a compound 
assignment operator are separate tokens. 

assignment-expression: 

lvalue - expression 
lvalue +« expression 
lvalue -« expression 
lvalue *« expression 
lvalue /- expression 
lvalue %« expression 
lvalue »* expression 
lvalue <<= expression 
lvalue &« expression 
lvalue expression 
lvalue I« expression 

In the simple assignment with «, the value of the expression replaces that of the object referred to by 
the lvalue. If both operands have arithmetic type, the right operand is converted to the type of the left 
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preparatory to the assignment. 

The behavior of an expression of the form El op- E2 may be inferred by taking it as equivalent to 
El - El op (E2): however, El is evaluated only once. In +« and —, the left operand may be a 
pointer, in which case the (integral) right operand is converted as explained in §7.4. all right operands 
and all non-pointer left operands must have arithmetic type. 

The compilers currently allow a pointer to be assigned to an integer, an integer to a pointer, and a 
pointer to a pointer of another type. The assignment is a pure copy operation, with no conversion. This 
usage is nonportable, and may produce pointers which cause addressing exceptions when used. However, 
it is guaranteed that assignment of the constant 0 to a pointer will produce a null pointer distinguishable 
from a pointer to any object. 

7.15 Comma operator 

comma-expression: 

expression , expression 

A pair of expressions separated by a comma is evaluated left-to-right and the value of the leit expression 
is discarded. The type and value of the result are the type and value of the right operand. This operator 
groups left-to-right. In contexts where comma is given a special meaning, for example in a list of actual 
arguments to functions (§7.1) and lists of initializers (§8.6), the comma operator as described in this sec¬ 
tion can only appear in parentheses: for example, 

f(a, (t-3, t+2), c) 

has three arguments, the second of which has the value 5. 

8. Declarations 

Declarations are used to specify the interpretation which C gives to each identifier: they do not 
necessarily reserve storage associated with the identifier. Declarations have the form 

declaration: 

ded-specifiers declarator-list^ ; 

The declarators in the declarator-list contain the identifiers being declared. The dccl-specifiers consist of a 
sequence of type and storage class specifiers. 

decl-specifiers: 

type-specifier ded-specifiers^ 
sc-specifier decl-specifiers^ 

The list must be self-consistent in a way described below. 

8.1 Storage class specifiers 
The sc-specifiers are: 

sc-specifier: 

auto 

static 

extern 

register 

typedef 

The typedef specifier does not reserve storage and is called a "storage class specifier only for syntactic 
convenience: it is discussed in §8.8. The meanings of the various storage classes were discussed in §4. 

The auto, static, and register declarations also serve as definitions in that they cause an 
appropriate amount of storage to be reserved. In the extern case there must be an external definition 
(§10) for the given identifiers somewhere outside the function in which they are declared. 

A register declaration is best thought of as an auto declaration, together with a hint to the com¬ 
piler that the variables declared will be heavily used. Only the first few such declarations are effective. 
Moreover, only variables of certain types will be stored in registers: on the PDP-11, they are int. char, 
or pointer. One other restriction applies to register variables: the address-of operator & cannot be applied 
to them. Smaller, faster programs can be expected if register declarations are used appropriately, but 
future improvements in code generation may render them unnecessary. 
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Al most one sc-specifier may be given in a declaration. If the sc-specifier is missing from a declara¬ 
tion. it is taken to be auto inside a function, extern outside. Exception: functions are never automatic. 

8.2 Type specifiers 

The type-specifiers are 

type-specifier: 

char 

short 

int 

long 

unsigned 

float 

double 

struct-or-union-specifier 

typedef-name 

The words long, short, and unsigned may be thought of as adjectives: the following combinations are 
acceptable. 

short int 
long int 
unsigned int 
long float 

The meaning of the last is the same as double. Otherwise, at most one type-specifier may be given in a 
declaration. If the type-specifier is missing from a declaration, it is taken to be int. 

Specifiers for structures and unions are discussed in §8.5: declarations with typedef names are dis¬ 
cussed in §8.8. 

8.3 Declarators 

The declarator-list appearing in a declaration is a comma-separated sequence of declarators, each of 
which may have an initializer. 

declarator-list: 

init-declarator 

init-declarator , declarator-list 

init-declarator: 

declarator initializer• 

Of* 

initializers are discussed in §8.6. The specifiers in the declaration indicate the type and storage class of 
the objects to which the declarators refer. Declarators have the syntax: 

declarator: 

identifier 
( declarator ) 

* declarator 
declarator () 

declarator [ constant-expression opi ] 

The grouping is the same as in expressions. 

8.4 Meaning of declarators 

Each declarator is taken to be an assertion that when a construction of the same form as the declara¬ 
tor appears in an expression, it yields an object of the indicated type and storage class. Each declarator 
contains exactly one identifier: it is this identifier that is declared. 

If an unadorned identifier appears as a declarator, then it has the type indicated by the specifier head¬ 
ing the declaration. 

A declarator in parentheses is identical to the unadorned declarator, but the binding of complex 
declarators may be altered by parentheses. See the examples below. 

Now imagine a declaration 
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T D1 


where T is a type-specifier (like int, etc.) and D1 is a declarator. Suppose this declaration makes the 
identifier have type "... T," where the "..." is empty if D1 is just a plain identifier (so that the type of 
x in "int x" is just int). Then if D1 has the form 

*D 

the type of the contained identifier is "... pointer to T." 

If D1 has the form 

DO 

then the contained identifier has the type "... function returning T." 

If D1 has the form 

D [ constaru-*xpression\ 

or 


DO 

then the contained identifier has type "... array of T." In the first case the constant expression is an 
expression whose value is determinable at compile time, and whose type is int. (Constant expressions 
are defined precisely in 515.) When several "array of' specifications are adjacent, a multi-dimensional 
array is created: the constant expressions which specify the bounds of the arrays may be missing only for 
the first member of the sequence. This elision is useful when the array is external and the actual 
definition, which allocates storage, is given elsewhere. The first constant-expression may also be omitted 
when the declarator is followed by initialization. In this case the size is calculated from the number of 
initial elements supplied. 

An array may be constructed from one of the basic types, from a pointer, from a structure or union, 
or from another array (to generate a multi-dimensional array). 

Not all the possibilities allowed by the syntax above are actually permitted. The restrictions are as 
follows: functions may not return arrays, structures, unions or functions, although they may return 
pointers to such things; there are no arrays of functions, although there may be arrays of pointers to 
functions. Likewise a structure or union may not contain a function, but it may contain a pointer to a 
function. 

As an example, the declaration 

int i, *ip, £0, *£ip(), (*p£i) (); 

declares an integer i. a pointer ip to an integer, a function £ returning an integer, a function fip 
returning a pointer to an integer, and a pointer p£i to a function which returns an integer. It is espe¬ 
cially useful to compare the last two. The binding of *£ip() is *(£ip() ). so that the declaration sug¬ 
gests. and the same construction in an expression requires, the calling of a function £ip, and then using 
indirection through the (pointer) result to yield an integer. In the declarator (*pfi) 0 . the extra 
parentheses are necessary, as they are also in an expression, to indicate that indirection through a pointer 
to a function yields a function, which is then called; it returns an integer. 

As another example. 

£loat £a[17 ], *a£p[17]; 

declares an array of £loat numbers and an array of pointers to float numbers. Finally, 
static int x3d(3][5][7]; 

declares a static three-dimensional array of integers, with rank 3x5x7. In complete detail. x3d is an 
array of three items; each item is an array of five arrays; each of the latter arrays is an array of seven 
integers. Any of the expressions x3d. x3d(i], x3d(i] (j], x3d[i] [ j] [k] may reasonably appear in 
an expression. The first three have type "array." the last has type int. 

S.5 Structure and union declarations 

A structure is an object consisting of a sequence of named members. Each member may have any 
type. A union is an object which may. at a given lime, contain any one of several members. Structure 
and union specifiers have the same form. 









struct- or-union-specifier: 

struct-or-union { struct-decl-hst } 
struct-or-union identifier { struct-decl-hst ) 
struct-or-union identifier 

struct-or-union: 

struct 

union 

The siruci-decl-lisi is a sequence of declarations for the members of the structure or union: 

srruct-decl-list: 

struct-declaration 
struct-dedaration struct-decl- list 

struct-declara tion: 

type-specifier struct-declarator-list ; 

struct-declarator-list: 

struct-declara tor 

struct-declarator , struct-declarator-list 

In the usual case, a struct-declarator is just a declarator for a member of a structure or union. A struc¬ 
ture member may also consist of a specified number of bits. Such a member is also called a field\ its 
length is set off from the field name by a colon. 

struct-declarator: 
declarator 

declarator : constant-expression 
: constant-expression 

Within a structure, the objects declared have addresses which increase as their declarations are read lefl- 
to-right. Each non-field member of a structure begins on an addressing boundary appropriate to its type; 
therefore, there may be unnamed holes in a structure. Field members are packed into machine integers; 
they do not straddle words. A field which does not fit into the space remaining in a word is put into the 
next word. No field may be wider than a word. Fields are assigned right-to-left on the PDP-11, left-to- 
right on other machines. 

A struct-declarator with no declarator, only a colon and a width, indicates an unnamed field useful 
for padding to conform to externally-imposed layouts. As a special case, an unnamed field with a width 
of 0 specifies alignment of the next field at a word boundary. The “next field" presumably is a field, not 
an ordinary structure member, because in the latter case the alignment would have been automatic. 

The language does not restrict the types of things that are declared as fields, but implementations are 
not required to support any but integer fields. Moreover, even int fields may be considered to be 
unsigned. On the PDP-11, fields are not signed and have only integer values. In all implementations, 
there are no arrays of fields, and the address-of operator & may not be applied to them, so that there are 
no pointers to fields. 

A union may be thought of as a structure all of whose members begin at offset 0 and whose size is 
sufficient to contain any of its members. At most one of the members can be stored in a union at any 
lime. 

A structure or union specifier of the second form, that is, one of 

struct identifier { srruct-decl-list ) 
union identifier { struct-decl-lisr } 

declares the identifier to be the structure rag (or union tag) of the structure specified by the list. A subse¬ 
quent declaration may then use the third form of specifier, one of 

struct identifier 
union identifier 

Structure tags allow definition of self-referential structures; they also permit the long part of the declara¬ 
tion to be given once and used several times. It is illegal to declare a structure or union which contains 
an instance of itself, but a structure or union may contain a pointer to an instance of itself. 
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The names of members and tags may be the same as ordinary variables. However, names of tags 
and members must be mutually distinct. 

Two structures may share a common initial sequence of members: that is, the same member may 
appear in two different structures if it has the same type in both and if all previous members are the same 
in both. (Actually, the compiler checks only that a name in two different structures has the same type 
and offset in both, but if preceding members differ the construction is nonportable.) 

A simple example of a structure declaration is 

struct tnode ( 

char tvord[20]; 
int count; 
struct tnode ★left; 
struct tnode ★right; 

); 

which contains an array of 20 characters, an integer, and two pointers to similar structures. Once this 
declaration has been given, the declaration 

struct tnode s, ★sp; 

declares s to be a structure of the given sort and sp to be a pointer to a structure of the given sort. With 
these declarations, the expression 

sp->count 

refers to the count field of the structure to which sp points: 
s.left 

refers to the left subtree pointer of the structure s: and 
s.right->tword[0] 

refers to the first character of the tword member of the right subtree of s. 

8.6 Initialization 

A declarator may specify an initial value for the identifier being declared. The initializer is preceded 
by -, and consists of an expression or a list of values nested in braces. 

initializer: 

• expression 

• { initializer-list } 

■ ( initializer-list , } 

initializer-list: 

expression 

initializer-list , initializer-list 
( initializer-list } 

All the expressions in an initializer for a static or external variable must be constant expressions, 
which are described in §15, or expressions which reduce to the address of a previously declared variable, 
possibly offset by a constant expression. Automatic or register variables may be initialized by arbitrary 
expressions involving constants, and previously declared variables and functions. 

Static and external variables which are not initialized are guaranteed to start off as 0: automatic and 
register variables which are not initialized are guaranteed to start off as garbage. 

When an initializer applies to a scalar (a pointer or an object of arithmetic type), it consists of a sin¬ 
gle expression, perhaps in braces. The initial value of the object is taken from the expression: the same 
conversions as for assignment are performed. 

When the declared variable is an aggregate (a structure or array) then the initializer consists of a 
brace-enclosed, comma-separated list of initializers for the members of the aggregate, written in increas¬ 
ing subscript or member order. If the aggregate contains subaggregates, this rule applies recursively to 
the members of the aggregate. If there are fewer initializers in the list than there are members of the 
aggregate, then the aggregate is padded with 0's. It is not permitted to initialize unions or automatic 
aggregates. 
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Braces may be elided as follows. If ihe initializer begins with a left brace, then the succeeding 
comma-separated list of initializers initializes the members of the aggregate: it is erroneous for there to 
be more initializers than members. If, however, the initializer does not begin with a left brace, then only 
enough elements from the list are taken to account for the members of the aggregate: any remaining 
members are left to initialize the next member of the aggregate of which the current aggregate is a part. 

A final abbreviation allows a char array to be initialized by a string. In this case successive charac¬ 
ters of the string initialize the members of the array. 

For example, 

int x[] « { 1, 3, 5 }; 

declares and initializes x as a 1-dimensional array which has three members, since no size was specified 
and there are three initializers. 

float y[4] [3] ■ { 

( 1, 3, 5 }, 

I 2, 4, 6 ), 

I 3, 5, 7 ), 

); 

is a completely-bracketed initialization: I, 3, and 5 initialize the first row of the array y[0], namely 
y[0] [0], y[0] [1], and y[0] [2]. Likewise the next two lines initialize y[1] and y[2]. The initial¬ 
izer ends early and therefore y[3] is initialized with 0. Precisely the same efTect could have been 
achieved by 

float y [4] [3] - { 

1, 3, 5, 2, 4, 6, 3, 5, 7 

); 

The initializer for y begins with a left brace, but that for y[0] does not, therefore 3 elements from the 
list are used. Likewise the next three are taken successively for y[1 ] and y [2], Also, 

float y [4][3] - { 

(II, (2|, {3|, {4} 

I; 

initializes the first column of y (regarded as a two-dimensional array) and leaves the rest 0. 

Finally, 

char msg[] - "Syntax error on line %s\n"; 
shows a character array whose members are initialized with a string. 

8.7 Type names 

In two contexts (to specify type conversions explicitly by means of a cast, and as an argument of 
sizeof) it is desired to supply the name of a data type. This is accomplished using a “type name," 
which in essence is a declaration for an object of that type which omits the name of the object. 

type-name: 

type-specifier abstract-declarator 

abstract-declarator: 

empty 

( abstract-declarator ) 

* abstract-declarator 
abstract-declarator () 

abstract-declarator [ constant-expression op1 ] 

To avoid ambiguity, in the construction 

( abstract-declarator ) 

the abstract-declarator is required to be non-empty. Linder this restriction, it is possible to identify 
uniquely the location in the abstract-declarator where the identifier would appear if the construction were 
a declarator in a declaration. The named type is then the same as the type of the hypothetical identifier. 
For example. 
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int 
int * 
int *[3] 
int (*) [3] 
int *0 
int (*) 0 

name respectively the types “integer," “pointer to integer," “array of 3 pointers to integers," “pointer 
to an array of 3 integers," “function returning pointer to integer," and “pointer to function returning an 
integer." 

8.8 Typedef 

Declarations whose “storage class" is typedef do not define storage, but instead define identifiers 
which can be used later as if they were type keywords naming fundamental or derived types. 

typedef-name: 

identifier 

Within the scope of a declaration involving typedef, each identifier appearing as part of any declarator 
therein become syntactically equivalent to the type keyword naming the type associated with the identifier 
in the way described in §8.4. For example, after 

typedef int MILES, *KLICKSP; 

typedef struct ( double re, ixn;} complex; 

the constructions 

MILES distance; 
extern KLICKSP metricp; 
complex z, *zp; 

are all legal declarations; the type of distance is int, that of metricp is “pointer to int," and that of 
z is the specified structure, zp is a pointer to such a structure. 

typedef does not introduce brand new types, only synonyms for types which could be specified in 
another way. Thus in the example above distance is considered to have exactly the same type as any 
other int object. 

9. Statements 

Except as indicated, statements are executed in sequence. 

9.1 Expression statement 

Most statements are expression statements, which have the form 
expression ; 

Usually expression statements are assignments or function calls. 

9.2 Compound statement, or block 

So that several statements can be used where one is expected, the compound statement (also, and 
equivalently, called “block") is provided: 

compound-statement: 

[ declaration-list opt statement-list^ ) 

declaration-list: 

declaration 

declaration declaration-list 

statement-list: 

statement 

statement statement-list 

If any of the identifiers in the declaration-list were previously declared, the outer declaration is pushed 
down for the duration of the block, after which it resumes its force. 
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Any initializations of auto or register variables are performed each lime the block is entered at 
the top. It is currently possible (but a bad practice) to transfer into a block, in that case the initializations 
are not performed. Initializations of static variables arc performed only once when the program begins 
execution. Inside a block, extern declarations do not reserve storage so initialization is not permitted. 

9.3 Conditional statement 

The two forms of the conditional statement are 

if ( expression ) statement 

if ( expression ) statement else statement 

In both cases the expression is evaluated and if it is non-zero, the Hrst substaiemeni is executed. In the 
second case the second substaiemeni is executed if the expression is 0. As usual the “else" ambiguity is 
resolved by connecting an else with the last encountered else-less if. 

9.4 While statement 

The while statement has the form 

while ( expression ) statement 

The substatement is executed repeatedly so long as the value of the expression remains non-zero. The 
test lakes place before each execution of the statement. 

9.5 Do statement 

The do statement has the form 

do statement while ( expression ) ; 

The substatement is executed repeatedly until the value of the expression becomes zero. The test takes 
place after each execution of the statement. 

9.6 For statement 

The for statement has the form 

for ( expression-1^ ; expression-2^ ; expression-3^ ) statement 
This statement is equivalent to 

expression-1 ; 
while ( expression-2) { 
statement 
expression-3 ; 

) 

Thus the first expression specifies initialization for the loop: the second specifies a test, made before each 
iteration, such that the loop ts exited when the expression becomes 0; the third expression often specifies 
an incrementation which is performed after each iteration. 

Any or all of the expressions may be dropped. A missing expression-2 makes the implied while 
clause equivalent to while (1); other missing expressions are simply dropped from the expansion above. 

9.7 Switch statement 

The switch statement causes control to be transferred to one of several statements depending on 
the value of an expression. It has the form 

switch ( expression ) statement 

The usual arithmetic conversion is performed on the expression, but the result must be int. The state¬ 
ment is typically compound. Any statement within the statement may be labeled with one or more case 
prefixes as follows: 

case constant-expression : 

where the constant expression must be int. No two of the case constants in the same switch may have 
the same value. Constant expressions are precisely defined in §15. 

There may also be at most one statement prefix of the form 
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default : 


When the switch statement is executed, its expression is evaluated and compared with each ease con¬ 
stant. IT one of the case constants is equal to the value of the expression, control is passed to the state¬ 
ment following the matched ease prefix. If no ease constant matches the expression, and if there is a 
default prefix, control passes to the prefixed statement. If no case matches and if there is no default 
then none of the statements in the switch is executed. 

case and default prefixes in themselves do not alter the How of control, which continues unim¬ 
peded across such prefixes. To exit from a switch, see break. §9.8. 

Usually the statement that is the subject of a switch is compound. Declarations may appear at the 
head of this statement, but initializations of automatic or register variables are ineffective. 

9.8 Break statement 
The statement 

break ; 

causes termination of the smallest enclosing while, do. for. or switch statement: control passes to the 
statement following the terminated statement. 

9.9 Continue statement 
The statement 

continue ; 

causes control to pass to the loop-continuation portion of the smallest enclosing while, do. or for state¬ 
ment; that is to the end of the loop. More precisely, in each of the statements 

while (...)( do { for (...) I 

... ••• • • • 
contin: ; contin: ; contin: ; 

) ) while (...); ) 

a continue is equivalent to goto contin. (Following the contin: is a null statement. §9.13.) 

9.10 Return statement 

A function returns to its caller by means of the return statement, which has one of the forms 

return ; 
return expression ; 

In the first case the returned value is undefined. In the second case, the value of the expression is 
returned to the caller of the function. If required, the expression is converted, as if by assignment, to the 
type of the function in which it appears. Rowing off the end of a function is equivalent to a return with 
no returned value. 

9.11 Goto statement 

Control may be transferred unconditionally by means of the statement 
goto identifier ; 

The identifier must be a label (§9.12) located in the current function. 

9.12 Labeled statement 

Any statement may be preceded by label prefixes of the form 
identifier : 

which serve to declare the identifier as a label. The only use of a label is as a target of a goto. The 
scope of a label is the current function, excluding any sub-blocks in which the same identifier has been 
redeclared. See §11. 
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9.13 Null statement 

The null statement has the form 


A null statement is useful to carry a label just before the I of a compound statement or to supply a null 
body to a looping statement such as while. 

10. External definitions 

A C program consists of a sequence of external definitions. An external definition declares an 
identifier to have storage class extern (by default) or perhaps static, and a specified type. The type- 
specifier (§8.2) may also be empty, in which case the type is taken to be int. The scope of external 
definitions persists to the end of the file in which they are declared just as the effect of declarations per¬ 
sists to the end of a block. The syntax of external definitions is the same as that of all declarations, 
except that only at this level\rnay the code for functions be given. 

10.1 External function definitions 
Function definitions have the form 



function-definition: 

decI-specifiers^ Junction-declarator Junction-body 


The only sc-specifiers allowed among the decl-specifiers are extern or static: see §11.2 for the distinc¬ 
tion between them. A function declarator is similar to a declarator for a “function returning ..." except 
that it lists the formal parameters of the function being defined. 



function-declarator: 

declarator ( parameter-list opt ) 

parameter-list: 

identifier 

identifier , parameter-list 


The function-body has the form 
function-body: 



declaration-list compound-statement 


The identifiers in the parameter list, and only those identifiers, may be declared in the declaration list. 
Any identifiers whose type is not given are taken to be int. The only storage class which may be 
specified is register; if it is specified, the corresponding actual parameter will be copied, if possible, 
into a register at the outset of the function. 

A simple example of a complete function definition is 


int max(a, b, c) 
int a, b, c; 

( 

int m; 

m ■ (a > b) ? a : b; 
returnUm > c) ? m : c) ; 



Here int is the type-specifier; max (a, b, c) is the function-declarator; int a, b, c; is the 
declaration-list for the formal parameters; { ... ) is the block giving the code for the statement. 

C converts all float actual parameters to double, so formal parameters declared float have their 
declaration adjusted to read double. Also, since a reference to an array in any context (in particular as 
an actual parameter) is taken to mean a pointer to the first element of the array, declarations of formal 
parameters declared “array of ..." are adjusted to read “pointer to ...". Finally, because structures, 
unions and functions cannot be passed to a function, it is useless to declare a formal parameter to be a 
structure, union or function (pointers to such objects are of course permitted). 
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10.2 External data definitions 

An externa) data definition has the form 

data-defmition: 

declaration 

The storage class of such data may be extern (which is the default) or static, but not auto or 
register. 

11. Scope rules 

A C program need not all be compiled at the same time: the source text of the program may be kept 
in several files, and precompiled routines may be loaded from libraries. Communication among the func¬ 
tions of a program may be carried out both through explicit calls and through manipulation of external 
data. 

Therefore, there are two kinds of scope to consider: first, what may be called the lexical scope of an 
identifier, which is essentially the region of a program during which it may be used without drawing 
“undefined identifier" diagnostics: and second, the scope associated with external identifiers, which is 
characterized by the rule that references to the same external identifier are references to the same object. 

11.1 Lexical scope 

The lexical scope of identifiers declared in external definitions persists from the definition through 
the end of the source file in which they appear. The lexical scope of identifiers which are formal parame¬ 
ters persists through the function with which they are associated. The lexical scope of identifiers declared 
at the head of blocks persists until the end of the block. The lexical scope of labels is the whole of the 
function in which they appear. 

Because all references to the same external identifier refer to the same object (see $11.2) the com¬ 
piler checks all declarations of the same external identifier for compatibility: in effect their scope is 
increased to the whole file in which they appear. 

In all cases, however, if an identifier is explicitly declared at the head of a block, including the block 
constituting a function, any declaration of that identifier outside the block is suspended until the end of 
the block. 

Remember also ($8.5) that identifiers associated with ordinary variables on the one hand and those 
associated with structure and union members and tags on the other form two disjoint classes which do 
not conflict. Members and tags follow the same scope rules as other identifiers, typedef names are in 
the same class as ordinary identifiers. They may be redeclared in inner blocks, but an explicit type must 
be given in the inner declaration: 

typedef float distance; 

( 

auto int distance; 


The int must be present in the second declaration, or it would be taken to be a declaration with no 
declarators and type distancet. 

11.2 Scope of externals 

If a function refers to an identifier declared to be extern, then somewhere among the files or 
libraries constituting the complete program there must be an external definition for the identifier. All 
functions in a given program which refer to the same external identifier refer to the same object, so care 
must be taken that the type and size specified in the definition are compatible with those specified by each 
function which references the data. 

The appearance of the extern keyword in an external definition indicates that storage for the 
identifiers being declared will be allocated in another file. Thus in a multi-file program, an external data 
definition without the extern specifier must appear in exactly one of the files. Any other files which 
wish to give an external definition for the identifier must include the extern in the definition. The 
identifier can be initialized only in the declaration where storage is allocated. 

Identifiers declared static at the top level in external definitions are not visible in other files. 
Functions may be declared static. 

fit is agreed (hat the ice is thin here. 


44 





12. Compiler control lines 

The C compiler contains a preprocessor capable of macro substitution, conditional compilation, and 
inclusion of named files. Lines beginning with # communicate with this preprocessor. These lines have 
syntax independent of the rest of the language: they may appear anywhere and have effect which lasts 
(independent of scope) until the end of the source program file 

12.1 Token replacement 

A compiler-control line of the form 

#define identifier token-siring 

(note: no trailing semicolon) causes the preprocessor to replace subsequent instances of the identifier with 
the given string of tokens. A line of the form 

#def ine identifier ( identifier , ... , identifier ) token-string 

where there is no space between the first identifier and the (, is a macro definition with arguments. Sub¬ 
sequent instances of the first identifier followed by a (.a sequence of tokens delimited by commas, and a 
) are replaced by the token string in the definition. Each occurrence of an identifier mentioned in the 
formal parameter list of the definition is replaced by the corresponding token siring from the call. The 
actual arguments in the call are token strings separated by commas: however commas in quoted strings or 
protected by parentheses do not separate arguments. The number of formal and actual parameters must 
be the same. Text inside a string or a character constant is not subject to replacement. 

In both forms the replacement string is rescanned for more defined identifiers. In both forms a long 
definition may be continued on another line by writing \ at the end of the line to be continued. 

This facility is most valuable for definition of “manifest constants," as in 

♦define TABSIZE 100 

int table[TABSIZE]; 

A control line of the form 

#undef identifier 

causes the identifier's preprocessor definition to be forgotten. 

12.2 File inclusion 

A compiler control line of the form 

#include "filename" 

causes the replacement of that line by the entire contents of the file filename. The named file is searched 
for first in the directory of the original source file, and then in a sequence of standard places. Alterna¬ 
tively. a control line of the form 

♦include <filename> 

searches only the standard places, and not the directory of the source file 
♦ include's may be nested. 

12.3 Conditional compilation 

A compiler control line of the form 

#if constant-expression 

checks whether the constant expression (see §15) evaluates to non-zero. A control line ot the form 
♦ifdef identifier 

checks whether the identifier is currently defined in the preprocessor: that is, whether it has been the 
subject of a tdefine control line. A control line of the form 

iifndef identifier 

checks whether the identifier is currently undefined in the preprocessor 

All three forms are followed by an arbitrary number of lines, possibly containing a control line 
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#else 


und then by a control line 
#endif 

If the checked condition is true then any lines between #else and #endif are ignored. If the checked 
condition is false then any lines between the test and an ftelse or, lacking an #else. the gendif. arc 
ignored. 

These constructions may be nested. 

12.4 Line control 

For the benefit of other preprocessors which generate C programs, a line of the form 
#line constant identifier 

causes the compiler to believe, for purposes of error diagnostics, that the line number of the next source 
line is given by the constant and the current input file is named by the identifier. If the identifier is 
absent the remembered file name does not change. 

13. Implicit declarations 

It is not always necessary to specify both the storage class and the type of identifiers in a declaration. 
The storage class is supplied by the context in external definitions and in declarations of formal parame¬ 
ters and structure members. In a declaration inside a function, if a storage class but no type is given, the 
identifier is assumed to be int: if a type but no storage class is indicated, the identifier is assumed to be 
auto. An exception to the latter rule is made for functions, since auto functions are meaningless (C 
being incapable of compiling code into the stack): if the type of an identifier is “function returning ...". it 
is implicitly declared to be extern. 

In an expression, an identifier followed by ( and not already declared is contextually declared to be 
“function returning int". 

14. Types revisited 

This section summarizes the operations which can be performed on objects of certain types. 

14.1 Structures and unions 

There are only two things that can be done with a structure or union: name one of its members (by 
means of the • operator): or take its address (by unary a). Other operations, such as assigning from or 
to it or passing it as a parameter, draw an error message. In the future, it is expected that these opera¬ 
tions, but not necessarily others, will be allowed. 

57.1 says that in a direct or indirect structure reference (with • or ->) the name on the right must 
be a member of the structure named or pointed to by the expression on the left. To allow an escape 
from the typing rules, this restriction is not firmly enforced by the compiler. In fact, any lvalue is allowed 
before • , and that lvalue is then assumed to have the form of the structure of which the name on the 
right is a member. Also, the expression before a -> is required only to be a pointer or an integer. If a 
pointer, it is assumed to point to a structure of which the name on the right is a member. If an integer, 
it is taken to be the absolute address, in machine storage units, of the appropriate structure. 

Such constructions are non-portable. 

14.2 Functions 

There are only two things that can be done with a function: call it, or take*iis address. If the name 
of a function appears in an expression not in the function-name position of a call, a pointer to the func¬ 
tion is generated. Thus, to pass one function to another, one might say 

int f(); 
g(f); 

Then the definition of g might read 
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g(funcp) 

int (*funcp)(); 

( 

(*funcp) 0; 

} 

Notice that f must be declared explicitly in the calling routine since its appearance in g(f) was not fol¬ 
lowed by (. 

14.3 Arrays, pointers, and subscripting 

Every time an identifier of array type appears in an expression, it is converted into a pointer to the 
first member of the array. Because of this conversion, arrays are not lvalues. By definition, the subscript 
operator [] is interpreted in such a way that El [E2] is identical to *( (El ) + (E2)). Because of the 
conversion rules which apply to +, if El is an array and E2 an integer, then El [E2] refers to the E2-th 
member of El Therefore, despite its asymmetric appearance, subscripting is a commutative operation. 

A consistent rule is followed in the case of multi-dimensional arrays. If E is an /r-dimensional array 
of rank ixjx • • xk , then E appearing in an expression is converted to a pointer to an (/i—1). 
dimensional array with rank jx • • • xk. If the * operator, either explicitly or implicitly as a result of 
subscripting, is applied to this pointer, the result is the pointed-to (n-l)-dimensional array, which itself 
is immediately converted into a pointer. 

For example, consider 

int x[3][5]; 

Here x is a 3x5 array of integers. When x appears in an expression, it is converted to a pointer to (the 
first of three) 5-membered arrays of integers. In the expression x[i], which is equivalent to *(x+i), x 
is first converted to a pointer as described; then i is converted to the type of x, which involves multiply¬ 
ing i by the length the object to which the pointer points, namely 5 integer objects. The results are 
added and indirection applied to yield an array (of 5 integers) which in turn is converted to a pointer to 
the first of the integers. If there is another subscript the same argument applies again; this time the 
result is an integer. 

It follows from all this that arrays in C are stored row-wise (last subscript varies fastest) and that the 
first subscript in the declaration helps determine the amount of storage consumed by an array but plays 
no other part in subscript calculations. 

14.4 Explicit pointer conversions 

Certain conversions involving pointers are permitted but have implementation-dependent aspects. 
They are all specified by means of an explicit type-conversion operator, §§7.2 and 8.7. 

A pointer may be converted to any of the integral types large enough to hold it. Whether an int or 
long is required is machine dependent. The mapping function is also machine dependent, but is 
intended to be unsurprising to those who know the addressing structure of the machine. Details for 
some particular machines are given below. 

An object of integral type may be explicitly converted to a pointer. The mapping always carries an 
integer converted from a pointer back to the same pointer, but is otherwise machine dependent. 

A pointer to one type may be converted to a pointer to another type. The resulting pointer may 
cause addressing exceptions upon use if the subject pointer does not refer to an object suitably aligned in 
storage. It is guaranteed that a pointer to an object of a given size may be converted to a pointer to an 
object of a smaller size and back again without change. 

For example, a storage-allocation routine might accept a size (in bytes) of an object to allocate, and 
return a char pointer; it might be used in this way. 

extern char *alloc(); 
double *dp; 

dp - (double *) alloc(sizeof(double)); 

★dp - 22.0 / 7.0; 

alloc must ensure (in a machine-dependent way) that its return value is suitable for conversion to a 
pointer to double; then the use of the function is portable. 
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The pointer representation on the PDP-11 corresponds to a 16-bit integer and is measured in bytes, 
chars have no alignment requirements; everything else must have an even address. 

On the Honeywell 6000, a pointer corresponds to a 36-bit integer; the word part is in the left 18 bits, 
and the two bits that select the character in a word just to their right. Thus char pointers are measured 
in units of 2 16 bytes; everything else is measured in units of 2 IS machine words, double quantities and 
aggregates containing them must lie on an even word address (0 mod 2 19 ). 

The IBM 370 and the Interdata 8/32 are similar. On both, addresses are measured in bytes; elemen¬ 
tary objects must be aligned on a boundary equal to their length, so pointers to short must be 0 mod 2, 
to int and float 0 mod 4, and to double 0 mod 8. Aggregates are aligned on the strictest boundary 
required by any of their constituents. 

IS. Constant expressions 

In several places C requires expressions which evaluate to a constant: after case, as array bounds, 
and in initializers. In the first two cases, the expression can involve only integer constants, character con¬ 
stants, and sizeof expressions, possibly connected by the binary operators 

♦ - * / % a I * « » ■■ !-<><->- 

or by the unary operators 


or by the ternary operator 
?: 

Parentheses can be used for grouping, but not for function calls. 

More latitude is permitted for initializers; besides constant expressions as discussed above, one can 
also apply the unary & operator to external or static objects, and to external or static arrays subscripted 
with a constant expression. The unary & can also be applied implicitly by appearance of unsubscripted 
arrays and functions. The basic rule is that initializers must evaluate either to a constant or to the 
address of a previously declared external or static objecLplus or minus a constant. 

16. Portability considerations 

Certain parts of C are inherently machine dependent. The following list of potential trouble spots is 
not meant to be all-inclusive, but to point out the main ones. 

Purely hardware issues like word size and the properties of floating point arithmetic and integer divi¬ 
sion have proven in practice to be not much of a problem. Other facets of the hardware are reflected in 
differing implementations. Some of these, particularly sign extension (converting a negative character 
into a negative integer) and the order in which bytes are placed in a word, are a nuisance that must be 
carefully watched. Most of the others are only minor problems. 

The number of register variables that can actually be placed in registers varies from machine to 
machine, as does the set of valid types. Nonetheless, the compilers all do things properly for their own 
machine; excess or invalid register declarations are ignored. 

Some difficulties arise only when dubious coding practices are used. It is exceedingly unwise to write 
programs that depend on any of these properties. 

The order of evaluation of function arguments is not specified by the language. It is right to left on 
the PDP-11, and vaX-11, left to right on the others. The order in which side effects take place is also 
unspecified. 

Since character constants are really objects of type int, multi-character character constants may be 
permitted. The specific implementation is very machine dependent, however, because the order in which 
characters are assigned to a word varies from one machine to another. 

Fields are assigned to words and characters to integers right-to-left on the PDP-11 and vaX-11 and 
left-to-right on other machines. These differences are invisible to isolated programs which do not indulge 
in type punning (for example, by converting an int pointer to a char pointer and inspecting the 
pointed-to storage), but must be accounted for when conforming to externally-imposed storage layouts. 

The language accepted by the various compilers differs in minor details. Most notably, the current 
PDP-11 compiler will not initialize structures containing bit-fields, and does not accept a few assignment 
operators in certain contexts where the value of the assignment is used. 
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17. Anachronisms 

Since C is an evolving language, certain obsolete constructions may be found in older programs. 
Although most versions of the compiler support such anachronisms, ultimately they will disappear, leav¬ 
ing only a portability problem behind. 

Earlier versions of C used the form "-op instead of op— for assignment operators. This leads to 
ambiguities, typified by 



which actually decrements x since the - and the - are adjacent, but which might easily be intended to 
assign -1 to x. 

The syntax of initializers has changed: previously, the equals sign that introduces an initializer was 
not present, so instead of 

int x « 1; 


one used 

int x 1; 


The change was made because the initialization 
int f (1+2) 



resembles a function declaration closely enough to confuse the compilers. 
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18. Syntax Summary 

This summary of C syntax is intended more for aiding comprehension than as an exact statement of 
the language. 

18.1 Expressions 

The basic expressions are: 

expression: 

primary 

* expression 
& expression 

- expression 
! expression 

• expression 
++ lvalue 

— lvalue 
lvalue ♦+ 
lvalue — 

sizeof expression 
( type-name ) expression 
expression binop expression 
expression ? expression : expression 
lvalue asgnop expression 
expression , expression 

primary: 

identifier 

constant 

string 

( expression ) 

primary ( expression-list m ) 
primary [ expression ] 
lvalue . identifier 
primary -> identifier 


lvalue: 

identifier 

primary [ expression ] 
lvalue . identifier 
primary -> identifier 
* expression 
( lvalue ) 

The primary-expression operators 
0 [] . -> 

have highest priority and group left-to-right. The unary operators 

* 4 - ! • ♦+ — sizeof ( type-name ) 

have priority below the primary operators but higher than any binary operator, and group right-to-left. 
Binary operators group left-to-right; they have priority decreasing as indicated below. The conditional 
operator groups right to left. 
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binop: 

* / % 

♦ 

» « 

<><■>« 

— !- 

& 

I 

&& 

I I 
?: 

Assignment operators ail have the same priority, and all group right-to*loft. 
asgnop: 

■ ♦- -■ *■ /- %■ >>« «- &■ |« 

The comma operator has the lowest priority, and groups left-to-right. 

11.2 Declarations 

declaration: 

ded-specifiers init-dedarator-list^ ; 

ded-specifiers: 

type-specifier ded-specifiers' 
sc-specifier ded-specifiers^ 

sc-specifier: 

auto 

static 

extern 

register 

typedef 

type-specifier: 

char 

short 

int 

long 

unsigned 

float 

double 

struct-or-union-specifier 

typedef-name 

init-dechrator-Ust: 

init-dedarator 

init-dedarator , init-declarator-list 

init-dedarator: 

declarator initializer ■ 

OP* 

declarator: 

identifier 
( declarator ) 

* declarator 
declarator () 

declarator [ constant-expression^ ] 





struct- or- un ton-specifier: 

struct I struct-decl-list } 
rtruct identifier ( struct-decl-list } 
struct identifier 
union { struct-decl-list ) 
union identifier { struct-decl-list ) 
union identifier 

struct-decl-fist: 

struct-declara non 
struct-declaration struct-decl-list 

struct-declaration: 

type-specifier struct-declarator-list ; 

struct-declara tor-list: 

struct-deda rat or 

struct-dedarator , struct-declara tor-list 

struct-declara tor: 
declarator 

declarator : constant-expression 
: constant-expression 

initializer: 

• expression 
« ( initializer-list ) 

■ ( initializer-list , ) 

initializer-list: 

expression 

initializer-list, initializer-list 
( initializer-list ) 


type-name: 

type-specifier abstract-declarator 

abstract-declarator: 

empty 

( abstract-declarator ) 

* abstract-declarator 
abstract-declarator () 

abstract-declarator ( constant-expression^ ] 

typedef-name: 

identifier 


1S.3 Statements 

compound-statement: 

( declaration-list■ statement-list' } 


declaration-list: 

declaration 

declaration declaration-list 
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statement-list: 

statement 

statement statement-list 
statement: 

compound-statement 
expression ; 

if ( expression ) statement 

if ( expression ) statement else statement 

while ( expression ) statement 

do statement while ( expression ) ; 

for ( expression- / ^ ; expression-2 itrt ; expression- i 

switch ( expression ) statement 

case constant-expression : statement 

default : statement 

break ; 

continue ; 

return ; 

return expression ; 

goto identifier ; 

identifier * statement 


18.4 External definitions 

program: 

external-definition 
external-definition program 

external-definition: 

function-definition 

data-definition 

function-definition: 

type-specifier^ function-declarator function-body 

Junction-declarator: 

declarator ( parameter-list^ ) 

parameter-list: 
identifier 

identifier , parameter-fist 


function-body: 

type-dec I-list function-statement 


Junction-statement: 

{ declaration-list^ statement-list I 


data-definition: 

extern^ type-spec ijier^ init-declarator-list m ; 
static^ type-specifier^ init-declarator-list' ; 


18.5 Preprocessor 


statement 
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•define identifier token-string 
•de f ine identifier ( identifier , 
•undef identifier 

• include " filename" 
•include <filename> 

•if constant-expression 
•ifdef identifier 

• ifndef identifier 
•else 

•endif 

•line constant identifier 


identifier ) token-string 
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Recent Changes to C 

November 15, 1978 


A few extensions have been made to the C language beyond what is described in the reference docu¬ 
ment (“The C Programming Language," Kernighan and Ritchie, Prentice-Hall, 1978). 

1. Structure assignment 

Structures may be assigned, passed as arguments to functions, and returned by functions. The types 
of operands taking part must be the same. Other plausible operators, such as equality comparison, have 
not been implemented. 

There is a subtle defect in the PDP-11 implementation of functions that return structures: if an inter¬ 
rupt occurs during the return sequence, and the same function is called reentrantly during the interrupt, 
the value returned from the first call may be corrupted. The problem can occur only in the presence of 
true interrupts, as in an operating system or a user program that makes significant use of signals; ordinary 
recursive calls are quite safe. 

2. Enumeration type 

There is a new data type analogous to the scalar types of Pascal. To the type-specifiers in the syntax 
on p. 193 of the C book add 

enum-specifier 

with syntax 

enum-specifier: 

enum ( enum-Ust ) 
enum identifier ( enum-Ust I 
enum identifier 

enum-list: 

enumerator 

enum-Ust , enumerator 

enumerator: 

identifier 

identifier - constant-expression 

The role of the identifier in the enum-specifier is entirely analogous to that of the structure tag in a 
struct-specifier; it names a particular enumeration. For example, 

enum color ( chartreuse, burgundy, claret, vinedark ); 
enum color *cp, col; 

makes color the enumeration-tag of a type describing various colors, and then declares cp as a pointer 
to an object of that type, and col as an object of that type. 

The identifiers in the enum-list are declared as constants, and may appear wherever constants are 
required. If no enumerators with - appear, then the values of the constants begin at 0 and increase by I 
as the declaration is read from left to right. An enumerator with - gives the associated identifier the 
value indicated; subsequent identifiers continue the progression from the assigned value. 

Enumeration tags and constants must all be distinct, and, unlike structure tags and members, are 
drawn from the same set as ordinary identifiers. 

Objects of a given enumeration type are regarded as having a type distinct from objects of all other 
types, and lint flags type mismatches. In the PDP-11 implementation all enumeration variables are treated 
as if they were int. 
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A Tour Through the Portable C Compiler 


S. C. Johnson 
Bell Laboratories 
Murray Hill, New Jersey 07974 


Introduction 

A C compiler has been implemented that has proved to be quite portable, serving as the 
basis for C compilers on roughly a dozen machines, including the Honeywell 6000, IBM 370, 
and Interdata 8/32. The compiler is highly compatible with the C language standard. 1 

Among the goals of this compiler are portability, high reliability, and the use of state-of- 
the-art techniques and tools wherever practical. Although the efficiency of the compiling pro¬ 
cess is not a primary goal, the compiler is efficient enough, and produces good enough code, to 
serve as a production compiler. 

The language implemented is highly compatible with the current PDP-11 version of C. 
Moreover, roughly 75% of the compiler, including nearly all the syntactic and semantic rou¬ 
tines, is machine independent. The compiler also serves as the major portion of the program 
lint , described elsewhere. 2 

A number of earlier attempts to make portable compilers are worth noting. While on 
CO-OP assignment to Bell Labs in 1973, Alan Snyder wrote a portable C compiler which was 
the basis of his Master’s Thesis at M.I.T. 3 This compiler was very slow and complicated, and 
contained a number of rather serious implementation difficulties; nevertheless, a number of 
Snyder’s ideas appear in this work. 

Most earlier portable compilers, including Snyder’s, have proceeded by defining an inter¬ 
mediate language, perhaps based on three-address code or code for a stack machine, and writ¬ 
ing a machine independent program to translate from the source code to this intermediate 
code. The intermediate code is then read by a second pass, and interpreted or compiled. This 
approach is elegant, and has a number of advantages, especially if the target machine is far 
removed from the host. It suffers from some disadvantages as well. Some constructions, like 
initialization and subroutine prologs, are difficult or expensive to express in a machine 
independent way that still allows them to be easily adapted to the target assemblers. Most of 
these approaches require a symbol table to be constructed in the second (machine dependent) 
pass, and/or require powerful target assemblers. Also, many conversion operators may be gen¬ 
erated that have no effect on a given machine, but may be needed on others (for example, 
pointer to pointer conversions usually do nothing in C, but must be generated because there 
are some machines where they are significant). 

For these reasons, the first pass of the portable compiler is not entirely machine indepen¬ 
dent. It contains some machine dependent features, such as initialization, subroutine prolog 
and epilog, certain storage allocation functions, code for the switch statement, and code to 
throw out unneeded conversion operators. 

As a crude measure of the degree of portability actually achieved, the Interdata 8/32 C 
compiler has roughly 600 machine dependent lines of source out of 4600 in Pass 1, and 1000 
out of 3400 in Pass 2. In total, 1600 out of 8000, or 20%, of the total source is machine depen¬ 
dent (12% in Pass 1, 30% in Pass 2). These percentages can be expected to rise slightly as the 
compiler is tuned. The percentage of machine-dependent code for the IBM is 22%, for the 
Honeywell 25%. If the assembler format and structure were the same for all these machines, 
perhaps another 5-10% of the code would become machine independent. 
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These figures are sufficiently misleading as to be almost meaningless. A large fraction of 
the machine dependent code can be converted in a straightforward, almost mechanical way. 
On the other hand, a certain amount of the code requres hard intellectual effort to convert, 
since the algorithms embodied in this part of the code are typically complicated and machine 
dependent. 

To summarize, however, if you need a C compiler written for a machine with a reason¬ 
able architecture, the compiler is already three quarters finished! 

Overview 

This paper discusses the structure and organization of the portable compiler. The intent 
is to give the big picture, rather than discussing the details of a particular machine implemen¬ 
tation. After a brief overview and a discussion of the source file structure, the paper describes 
the major data structures, and then delves more closely into the two passes. Some of the 
theoretical work on which the compiler is based, and its application to the compiler, is dis¬ 
cussed elsewhere. 4 One of the major design issues in any C compiler, the design of the calling 
sequence and stack frame, is the subject of a separate memorandum. 5 

The compiler consists of two passes, passl and pass2> that together turn C source code 
into assembler code for the target machine. The two passes are preceded by a preprocessor, 
that handles the #define and #include statements, and related features (e.g., #ifdef, etc.). It 
is a nearly machine independent program, and will not be further discussed here. 

The output of the preprocessor is a text file that is read as the standard input of the first 
pass. This produces as standard output another text file that becomes the standard input of 
the second pass. The second pass produces, as standard output, the desired assembler 
language source code. The preprocessor and the two passes all write error messages on the 
standard error file. Thus the compiler itself makes few demands on the I/O library support, 
aiding in the bootstrapping process. 

Although the compiler is divided into two passes, this represents historical accident more 
than deep necessity. In fact, the compiler can optionally be loaded so that both passes operate 
in the same program. This “one pass” operation eliminates the overhead of reading and writ¬ 
ing the intermediate file, so the compiler operates about 30% faster in this mode. It also occu¬ 
pies about 30% more space than the larger of the two component passes. 

Because the compiler is fundamentally structured as two passes, even when loaded as one, 
this document primarily describes the two pass version. 

The first pass does the lexical analysis, parsing, and symbol table maintenance. It also 
constructs parse trees for expressions, and keeps track of the types of the nodes in these trees. 
Additional code is devoted to initialization. Machine dependent portions of the first pass serve 
to generate subroutine prologs and epilogs, code for switches, and code for branches, label 
definitions, alignment operations, changes of location counter, etc. 

The intermediate file is a text file organized into lines. Lines beginning with a right 
parenthesis are copied by the second pass directly to its output file, with the parenthesis 
stripped off. Thus, when the first pass produces assembly code, such as subroutine prologs, 
etc., each line is prefaced with a right parenthesis; the second pass passes these lines to 
through to the assembler. 

The major job done by the second pass is generation of code for expressions. The expres¬ 
sion parse trees produced in the first pass are written onto the intermediate file in Polish 
Prefix form: first, there is a line beginning with a period, followed by the source file line 
number and name on which the expression appeared (for debugging purposes). The successive 
lines represent the nodes of the parse tree, one node per line. Each line contains the node 
number, type, and any values (e.g., values of constants) that may appear in the node. Lines 
representing nodes with descendants are immediately followed by the left subtree of descen¬ 
dants, then the right. Since the number of descendants of any node is completely determined 
by the node number, there is no need to mark the end of the tree. 
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There are only two other line types in the intermediate file. Lines beginning with a left 
square bracket (*[’) represent the beginning of blocks (delimited by { ... ) in the C source); lines 
beginning with right square brackets (*]’) represent the end of blocks. The remainder of these 
lines tell how much stack space, and how many register variables, are currently in use. 

Thus, the second pass reads the intermediate files, copies the *) 9 lines, makes note of the 
information in the ‘[’ and ‘] 9 lines, and devotes most of its effort to the V lines and their associ¬ 
ated expression trees, turning them turns into assembly code to evaluate the expressions. 

In the one pass version of the compiler, the expression trees that are built by the first 
pass have been declared to have room for the second pass information as well. Instead of writ¬ 
ing the trees onto an intermediate file, each tree is transformed in place into an acceptable 
form for the code generator. The code generator then writes the result of compiling this tree 
onto the standard output. Instead of *[’ and ‘]’ lines in the intermediate file, the information is 
passed directly to the second pass routines. Assembly code produced by the first pass is simply 
written out, without the need for 9 ) 9 at the head of each line. 

The Source Files 

The compiler source consists of 22 source files. Two files, manifest and macdefs , are 
header files included with all other files. Manifest has declarations for the node numbers, 
types, storage classes, and other global data definitions. Macdefs has machine-dependent 
definitions, such as the size and alignment of the various data representations. Two machine 
independent header files, mfdel and mfde2 , contain the data structure and manifest definitions 
for the first and second passes, respectively. In the second pass, a machine dependent header 
file, mac2defs y contains declarations of register names, etc. 

There is a file, common , containing (machine independent) routines used in both passes. 
These include routines for allocating and freeing trees, walking over trees, printing debugging 
information, and printing error messages. There are two dummy files, comml.c and comm2.c , 
that simply include common within the scope of the appropriate passl or pass2 header files. 
When the compiler is loaded as a single pass, common only needs to be included once: comm2.c 
is not needed. 

Entire sections of this document are devoted to the detailed structure of the passes. For 
the moment, we just give a brief description of the files. The first pass is obtained by compil¬ 
ing and loading scan.c y cgram.c y xdefs.c y pftn.c , trees. c, optim.c y local.c , code.c, and comml.c . 
Scan.c is the lexical analyzer, which is used by cgram.c, the result of applying Yacc 6 to the 
input grammar cgram.y . Xdefs.c is a short file of external definitions. Pftn.c maintains the 
symbol table, and does initialization. Trees.c builds the expression trees, and computes the 
node types. Optim.c does some machine independent optimizations on the expression trees. 
Comml.c includes common , that contains service routines common to the two passes of the 
compiler. All the above files are machine independent. The files local.c and code.c contain 
machine dependent code for generating subroutine prologs, switch code, and the like. 

The second pass is produced by compiling and loading reader.c, allo.c y match.c y comml.c , 
order. c, local.c , and table.c. Reader.c reads the intermediate file, and controls the major logic of 
the code generation. Allo.c keeps track of busy and free registers. Match.c controls the match¬ 
ing of code templates to subtrees of the expression tree to be compiled. Comm2.c includes the 
file common , as in the first pass. The above files are machine independent. Order.c controls 
the machine dependent details of the code generation strategy. Local2.c has many small 
machine dependent routines, and tables of opcodes, register types, etc. Table.c has the code 
template tables, which are also clearly machine dependent. 

Data Structure Considerations. 

This section discusses the node numbers, type words, and expression trees, used 
throughout both passes of the compiler. 


58 




The file manifest defines those symbols used throughout both passes. The intent is to 
use the same symbol name (e.g., MINUS) for the given operator throughout the lexical 
analysis, parsing, tree building, and code generation phases; this requires some synchronization 
with the Yacc input file, cgram.y, as well. 

A token like MINUS may be seen in the lexical analyzer before it is known whether it is 
a unary or binary operator; clearly, it is necessary to know this by the time the parse tree is 
constructed. Thus, an operator (really a macro) called UNARY is provided, so that MINUS 
and UNARY MINUS are both distinct node numbers. Similarly, many binary operators exist 
in an assignment form (for example, - = ), and the operator ASG may be applied to such node 
names to generate new ones, e.g. ASG MINUS. 

It is frequently desirable to know if a node represents a leaf (no descendants), a unary 
operator (one descendant) or a binary operator (two descendants). The macro optype(o) 
returns one of the manifest constants LTYPE, UTYPE, or BITYPE, respectively, depending 
on the node number o. Similarly, asgop(o) returns true if o is an assignment operator number 
( = * + = > etc - )» and logop(o) returns true if o is a relational or logical (&&, I, or !) operator. 

C has a rich typing structure, with a potentially infinite number of types. To begin with, 
there are the basic types: CHAR, SHORT, INT, LONG, the unsigned versions known as 
UCHAR, USHORT, UNSIGNED, ULONG, and FLOAT, DOUBLE, and finally STRTY (a 
structure), UNIONTY, and ENUMTY. Then, there are three operators that can be applied to 
types to make others: if t is a type, we may potentially have types pointer to t, function return¬ 
ing t, and array of t’s generated from t. Thus, an arbitrary type in C consists of a basic type, 
and zero or more of these operators. 

In the compiler, a type is represented by an unsigned integer; the rightmost four bits hold 
the basic type, and the remaining bits are divided into two-bit fields, containing 0 (no opera¬ 
tor), or one of the three operators described above. The modifiers are read right to left in the 
word, starting with the two-bit field adjacent to the basic type, until a field with 0 in it is 
reached. The macros PTR, FTN, and ARY represent the pointer to, function returning, and 
array of operators. The macro values are shifted so that they align with the first two-bit field; 
thus PTR+INT represents the type for an integer pointer, and 

ARY + (PTR«2) + (FTN«4) + DOUBLE 
represents the type of an array of pointers to functions returning doubles. 

The type words are ordinarily manipulated by macros. If Ms a type word, BTYPE(t) 
gives the basic type. ISPTR(t), ISARY(t), and ISFTN(t) ask if an object of this type is a 
pointer, array, or a function, respectively. MODTYPE(t.b) sets the basic type of t to b. 
DECREF(t) gives the type resulting from removing the first operator from t. Thus, if Ms a 
pointer to £’, a function returning t’, or an array of £’, then DECREF(t) would equal £’. 
INCREF(t) gives the type representing a pointer to t. Finally, there are operators for dealing 
with the unsigned types. ISUNSIGNED(t) returns true if t is one of the four basic unsigned 
types; in this case, DEUNSIGN(t) gives the associated ‘signed’ type. Similarly, 
UNSIGNABLE(t) returns true if t is one of the four basic types that could become unsigned, 
and ENUNSIGN(t) returns the unsigned analogue of t in this case. 

The other important global data structure is that of expression trees. The actual shapes 
of the nodes are given in mfilel and mfde2. They are not the same in the two passes; the first 
pass nodes contain dimension and size information, while the second pass nodes contain regis¬ 
ter allocation information. Nevertheless, all nodes contain fields called op, containing the node 
number, and type, containing the type word. A function called tallocQ returns a pointer to a 
new tree node. To free a node, its op field need merely be set to FREE. The other fields in 
the node will remain intact at least until the next allocation. 

Nodes representing binary operators contain fields, left and right, that contain pointers to 
the left and right descendants. Unary operator nodes have the left field, and a value field 
called rval. Leaf nodes, with no descendants, have two value fields: Ival and rval. 
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At appropriate times, the function tcheck() can be called, to check that there are no busy 
nodes remaining. This is used as a compiler consistency check. The function tcopy(p )i takes a 
pointer p that points to an expression tree, and returns a pointer to a disjoint copy of the tree. 
The function walkf(p.f) performs a postorder walk of the tree pointed to by p, and applies the 
function f to each node. The function fwalk(p,f,d) does a preorder walk of the tree pointed to 
by p. At each node, it calls a function /, passing to it the node pointer, a value passed down 
from its ancestor, and two pointers to values to be passed down to the left and right descen¬ 
dants (if any). The value d is the value passed down to the root. Fwalk is used for a number 
of tree labeling and debugging activities. 

The other major data structure, the symbol table, exists only in pass one, and will be dis¬ 
cussed later. 


Pass One 

The first pass does lexical analysis, parsing, symbol table maintenance, tree building, 
optimization, and a number of machine dependent things. This pass is largely machine 
independent, and the machine independent sections can be pretty successfully ignored. I hus, 
they will be only sketched here. 

Lexical Analysis 

The lexical analyzer is a conceptually simple routine that reads the input and returns the 
tokens of the C language as it encounters them: names, constants, operators, and keywords. 
The conceptual simplicity of this job is confounded a bit by several other simple jobs t at 
unfortunately must go on simultaneously. These include 

• Keeping track of the current filename and line number, and occasionally setting this 
information as the result of preprocessor control lines. 

• Skipping comments. 

• Properly dealing with octal, decimal, hex, floating point, and character constants, as well 
as character strings. 

To achieve speed, the program maintains several tables that are indexed into by character 
value, to tell the lexical analyzer what to do next. To achieve portability, these tables must be 
initialized each time the compiler is run, in order that the table entries reflect the local charac¬ 
ter set values. 

Parsing 

As mentioned above, the parser is generated by Yacc from the grammar on file cgram.y. 
The grammar is relatively readable, but contains some unusual features that are worth com¬ 
ment. 

Perhaps the strangest feature of the grammar is the treatment of declarations. The prob¬ 
lem is to keep track of the basic type and the storage class while interpreting the various stars, 
brackets, and parentheses that may surround a given name. The entire declaration mechanism 
must be recursive, since declarations may appear within declarations of structures and unions, 
or even within a sizeof construction inside a dimension in another declaration! 

There are some difficulties in using a bottom-up parser, such as produced by Yacc, to 
handle constructions where a lot of left context information must be kept around. The prob¬ 
lem is that the original PDP-11 compiler is top-down in implementation, and some of the 
semantics of C reflect this. In a top-down parser, the input rules are restricted somewhat, but 
one can naturally associate temporary storage with a rule at a very early stage in the recogni¬ 
tion of that rule. In a bottom-up parser, there is more freedom in the specification of rules, 
but it is more difficult to know what rule is being matched until the entire rule is seen. The 
parser described by cgram.c makes effective use of the bottom-up parsing mechanism in some 
places (notably the treatment of expressions), but struggles against the restrictions in others. 
The usual result is that it is necessary to run a stack of values “on the side”, independent of 
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the Yacc value stack, in order to be able to store and access information deep within inner con¬ 
structions, where the relationship of the rules being recognized to the total picture is not yet 
clear. 

In the case of declarations, the attribute information (type, etc.) for a declaration is care¬ 
fully kept immediately to the left of the declarator (that part of the declaration involving the 
name). In this way, when it is time to declare the name, the name and the type information 
can be quickly brought together. The “$0” mechanism of Yacc is used to accomplish this. The 
result is not pretty, but it works. The storage class information changes more slowly, so it is 
kept in an external variable, and stacked if necessary. Some of the grammar could be consider¬ 
ably cleaned up by using some more recent features of Yacc, notably actions within rules and 
the ability to return multiple values for actions. 

A stack is also used to keep track of the current location to be branched to when a break 
or continue statement is processed. 

This use of external stacks dates from the time when Yacc did not permit values to be 
structures. Some, or most, of this use of external stacks could be eliminated by redoing the 
grammar to use the mechanisms now provided. There are some areas, however, particularly 
the processing of structure, union, and enum declarations, function prologs, and switch state¬ 
ment processing, when having all the affected data together in an array speeds later processing; 
in this case, use of external storage seems essential. 

The cgram.y file also contains some small functions used as utility functions in the 
parser. These include routines for saving case values and labels in processing switches, and 
stacking and popping values on the external stack described above. 

Storage Classes 

C has a finite, but fairly extensive, number of storage classes available. One of the com¬ 
piler design decisions was to process the storage class information totally in the first pass; by 
the second pass, this information must have been totally dealt with. This means that all of the 
storage allocation must take place in the first pass, so that references to automatics and 
parameters can be turned into references to cells lying a certain number of bytes offset from 
certain machine registers. Much of this transformation is machine dependent, and strongly 
depends on the storage class. 

The classes include EXTERN (for externally declared, but not defined variables), 
EXTDEF (for external definitions), and similar distinctions for USTATIC and STATIC, 
UFORTRAN and FORTRAN (for fortran functions) and ULABEL and LABEL. The storage 
classes REGISTER and AUTO are obvious, as are STNAME, UNAME, and ENAME (for 
structure, union, and enumeration tags), and the associated MOS, MOU, and MOE (for the 
members). TYPEDEF is treated as a storage class as well. There are two special storage 
classes: PARAM and SNULL. SNULL is used to distinguish the case where no explicit 
storage class has been given; before an entry is made in the symbol table the true storage class 
is discovered. Similarly, PARAM is used for the temporary entry in the symbol table made 
before the declaration of function parameters is completed. 

The most complexity in the storage class process comes from bit fields. A separate 
storage class is kept for each width bit field; a k bit bit field has storage class k plus FIELD. 
This enables the size to be quickly recovered from the storage class. 

Symbol Table Maintenance. 

The symbol table routines do far more than simply enter names into the symbol table; 
considerable semantic processing and checking is done as well. For example, if a new declara¬ 
tion comes in, it must be checked to see if there is a previous declaration of the same symbol. 
If there is, there are many cases. The declarations may agree and be compatible (for example, 
an extern declaration can appear twice) in which case the new declaration is ignored. The new 
declaration may add information (such as an explicit array dimension) to an already present 
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declaration. The new declaration may be different, but still correct (for example, an extern 
declaration of something may be entered, and then later the definition may be seen) The new 
declaration may be incompatible, but appear in an inner block; in this case, the old declaration 
is carefully hidden away, and the new one comes into force until the block is left. Finally, the 
declarations may be incompatible, and an error message must be produced. 

A number of other factors make for additional complexity. The type declared by the user 
is not always the type entered into the symbol table (for example, if an formal parameter to a 
function is declared to be an array, C requires that this be changed into a pointer before entry 
in the symbol table). Moreover, there are various kinds of illegal types that may be declared 
which are difficult to check for syntactically (for example, a function returning an array). 
Finally, there is a strange feature in C that requires structure tag names and member names 
for structures and unions to be taken from a different logical symbol table than ordinary 
identifiers. Keeping track of which kind of name is involved is a bit of struggle (consider 
typedef names used within structure declarations, for example). 

The symbol table handling routines have been rewritten a number of times to extend 
features, improve performance, and fix bugs. They address the above problems with reasonable 
effectiveness but a singular lack of grace. 

When a name is read in the input, it is hashed, and the routine lookup is called, together 
with a flag which tells which symbol table should be searched (actually, both symbol tables are 
stored in one, and a flag is used to distinguish individual entries). If the name is found, (oofeup 
returns the index to the entry found; otherwise, it makes a new entry, marks it UNDh.r 
(undefined), and returns the index of the new entry. This index is stored in the rval field of a 

NAME node. 

When a declaration is being parsed, this NAME node is made part of a tree with UNARY 
MUL nodes for each *, LB nodes for each array descriptor (the right descendant has the 
dimension), and UNARY CALL nodes for each function descriptor. This tree is passed to the 
routine tymerge, along with the attribute type of the whole declaration; this routine collapses 
the tree to a single node, by calling tyreduce, and then modifies the type to reflect the overall 
type of the declaration. 

Dimension and size information is stored in a table called dimtab. To properly describe a 
type in C, one needs not just the type information but also size information (for structures and 
enums) and dimension information (for arrays). Sizes and offsets are dealt with in the com¬ 
piler by giving the associated indices into dimtab. Tymerge and tyreduce call dstash to put the 
discovered dimensions away into the dimtab array. Tymerge returns a pointer to a single node 
that contains the symbol table index in its rval field, and the size and dimension indices in 
fields csiz and cdim, respectively. This information is properly considered part of the type in 
the first pass, and is carried around at all times. 

To enter an element into the symbol table, the routine defid is called; it is handed a 
storage class, and a pointer to the node produced by tymerge. Defid calls fixtype, which adjusts 
and checks the given type depending on the storage class, and converts null types appropri¬ 
ately. It then calls fixclass, which does a similar job for the storage class; it is here, for exam¬ 
ple, that register declarations are either allowed or changed to auto. 

The new declaration is now compared against an older one, if present, and several pages 
of validity checks performed. If the definitions are compatible; with possibly some added infor- 
mation, the processing is straightforward. If the definitions differ, the block levels of the 
current and the old declaration are compared. The current block level is kept in blevel, an 
external variable; the old declaration level is kept in the symbol table. Block level 0 is for 
external declarations, 1 is for arguments to functions, and 2 and above are blocks within a 
function. If the current block level is the same as the old declaration, an error results. If the 
current block level is higher, the new declaration overrides the old. This is done by marking 
the old symbol table entry “hidden”, and making a new entry, marked “hiding”. Lookup will 
skip over hidden entries. When a block is left, the symbol table is searched, and any entries 
defined in that block are destroyed; if they hid other entries, the old entries are “unhidden”. 
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This nice block structure is warped a bit because labels do not follow the block structure 
rules (one can do a goto into a block, for example); default definitions of functions in inner 
blocks also persist clear out to the outermost scope. This implies that cleaning up the symbol 
table after block exit is more subtle than it might first seem. 

For successful new definitions, defid also initializes a “general purpose” field, offset, in the 
symbol table. It contains the stack offset for automatics and parameters, the register number 
for register variables, the bit offset into the structure for structure members, and the internal 
label number for static variables and labels. The offset field is set by falloc for bit fields, and 
dclstruct for structures and unions. 

The symbol table entry itself thus contains the name, type word, size and dimension 
offsets, offset value, and declaration block level. It also has a field of flags, describing what 
symbol table the name is in, and whether the entry is hidden, or hides another. Finally, a field 
gives the line number of the last use, or of the definition, of the name. This is used mainly for 
diagnostics, but is useful to lint as well. 

In some special cases, there is more than the above amount of information kept for the 
use of the compiler. This is especially true with structures; for use in initialization, structure 
declarations must have access to a list of the members of the structure. This list is also kept 
in dimtab. Because a structure can be mentioned long before the members are known, it is 
necessary to have another level of indirection in the table. The two words following the csiz 
entry in dimtab are used to hold the alignment of the structure, and the index in dimtab of the 
list of members. This list contains the symbol table indices for the structure members, ter¬ 
minated by a -1. 

Tree Building 

The portable compiler transforms expressions into expression trees. As the parser recog¬ 
nizes each rule making up an expression, it calls buildtree which is given an operator number, 
and pointers to the left and right descendants. Buildtree first examines the left and right des¬ 
cendants, and, if they are both constants, and the operator is appropriate, simply does the con¬ 
stant computation at compile time, and returns the result as a constant. Otherwise, buildtree 
allocates a node for the head of the tree, attaches the descendants to it, and ensures that 
conversion operators are generated if needed, and that the type of the new node is consistent 
with the types of the operands. There is also a considerable amount of semantic complexity 
here; many combinations of types are illegal, and the portable compiler makes a strong effort 
to check the legality of expression types completely. This is done both for lint purposes, and 
to prevent such semantic errors from being passed through to the code generator. 

The heart of buildtree is a large table, accessed by the routine opact. This routine maps 
the types of the left and right operands into a rather smaller set of descriptors, and then 
accesses a table (actually encoded in a switch statement) which for each operator and pair of 
types causes an action to be returned. The actions are logical or’s of a number of separate 
actions, which may be carried out by buildtree. These component actions may include check¬ 
ing the left side to ensure that it is an lvalue (can be stored into), applying a type conversion to 
the left or right operand, setting the type of the new node to the type of the left or right 
operand, calling various routines to balance the types of the left and right operands, and 
suppressing the ordinary conversion of arrays and function operands to pointers. An impor¬ 
tant operation is OTHER, which causes some special code to be invoked in buildtree, to handle 
issues which are unique to a particular operator. Examples of this are structure and union 
reference (actually handled by the routine stref), the building of NAME, ICON, STRING and 
FCON (floating point constant) nodes, unary * and &, structure assignment, and calls. In the 
case of unary * and &, buildtree will cancel a * applied to a tree, the top node of which is &, 
and conversely. 

Another special operation is PUN; this causes the compiler to check for type mismatches, 
such as intermixing pointers and integers. 
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The treatment of conversion operators is still a rather strange area of the compiler (and 
of C!). The recent introduction of type casts has only confounded this situation. Most of the 
conversion operators are generated by calls to tymatch and ptmatch, both of which are given a 
tree, and asked to make the operands agree in type. Ptmatch treats the case where one of the 
operands is a pointer; tymatch treats all other cases. Where these routines have decided on the 
proper type for an operand, they call makety, which is handed a tree, and a type word, dimen¬ 
sion offset, and size offset. If necessary, it inserts a conversion operation to make the types 
correct. Conversion operations are never inserted on the left side of assignment operators, 
however. There are two conversion operators used; PCONV, if the conversion is to a non-basic 
type (usually a pointer), and SCONV, if the conversion is to a basic type (scalar). 

To allow for maximum flexibility, every node produced by buildtree is given to a machine 
dependent routine, clocal, immediately after it is produced. This is to allow more or less 
immediate rewriting of those nodes which must be adapted for the local machine. The conver¬ 
sion operations are given to clocal as well; on most machines, many of these conversions do 
nothing, and should be thrown away (being careful to retain the type). If this operation is 
done too early, however, later calls to buildtree may get confused about correct type of the sub¬ 
trees; thus clocal is given the conversion ops only after the entire tree is built. This topic will 
be dealt with in more detail later. 

Initialization 

Initialization is one of the messier areas in the portable compiler. The only consolation 
is that most of the mess takes place in the machine independent part, where it is may be safely 
ignored by the implementor of the compiler for a particular machine. 

The basic problem is that the semantics of initialization really calls for a co-routine struc¬ 
ture; one collection of programs reading constants from the input stream, while another, 
independent set of programs places these constants into the appropriate spots in memory. The 
dramatic differences in the local assemblers also come to the fore here. The parsing problems 
are dealt with by keeping a rather extensive stack containing the current state of the initializa¬ 
tion; the assembler problems are dealt with by having a fair number of machine dependent rou¬ 
tines. 

The stack contains the symbol table number, type, dimension index, and size index for 
the current identifier being initialized. Another entry has the offset, in bits, of the beginning of 
the current identifier. Another entry keeps track of how many elements have been seen, if the 
current identifier is an array. Still another entry keeps track of the current member of a struc¬ 
ture being initialized. Finally, there is an entry containing flags which keep track of the 
current state of the initialization process (e.g., tell if a ) has been seen for the current 
identifier.) 

When an initialization begins, the routine beginit is called; it handles the alignment res¬ 
trictions, if any, and calls instk to create the stack entry. This is done by first making an entry 
on the top of the stack for the item being initialized. If the top entry is an array, another 
entry is made on the stack for the first element. If the top entry is a structure, another entry 
is made on the stack for the first member of the structure. This continues until the top ele¬ 
ment of the stack is a scalar. Instk then returns, and the parser begins collecting initializers. 

When a constant is obtained, the routine doinit is called; it examines the stack, and does 
whatever is necessary to assign the current constant to the scalar on the top of the stack, gots- 
cal is then called, which rearranges the stack so that the next scalar to be initialized gets 
placed on top of the stack. This process continues until the end of the initializers; endinit 
cleans up. If a ( or ) is encountered in the string of initializers, it is handled by calling ilbrace 
or irbrace, respectively. 

A central issue is the treatment of the “holes” that arise as a result of alignment restric¬ 
tions or explicit requests for holes in bit fields. There is a global variable, inoff, which contains 
the current offset in the initialization (all offsets in the first pass of the compiler are in bits). 
Doinit figures out from the top entry on the stack the expected bit offset of the next identifier; 
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it calls the machine dependent routine inforce which, in a machine dependent way, forces the 
assembler to set aside space if need be so that the next scalar seen will go into the appropriate 
bit offset position. The scalar itself is passed to one of the machine dependent routines fincode 
(for floating point initialization), incode (for fields, and other initializations less than an int in 
size), and cinit (for all other initializations). The size is passed to all these routines, and it is 
up to the machine dependent routines to ensure that the initializer occupies exactly the right 
size. 

Character strings represent a bit of an exception. If a character string is seen as the ini¬ 
tializer for a pointer, the characters making up the string must be put out under a different 
location counter. When the lexical analyzer sees the quote at the head of a character string, it 
returns the token STRING, but does not do anything with the contents. The parser calls 
getstr, which sets up the appropriate location counters and flags, and calls Ixstr to read and 
process the contents of the string. 

If the string is being used to initialize a character array, Ixstr calls putbyte, which in 
effect simulates doinit for each character read. If the string is used to initialize a character 
pointer, Ixstr calls a machine dependent routine, bycode, which stashes away each character. 
The pointer to this string is then returned, and processed normally by doinit. 

The null at the end of the string is treated as if it were read explicitly by Ixstr. 

Statements 

The first pass addresses four main areas; declarations, expressions, initialization, and 
statements. The statement processing is relatively simple; most of it is carried out in the 
parser directly. Most of the logic is concerned with allocating label numbers, defining the 
labels, and branching appropriately. An external symbol, reached, is 1 if a statement can be 
reached, 0 otherwise; this is used to do a bit of simple flow analysis as the program is being 
parsed, and also to avoid generating the subroutine return sequence if the subroutine cannot 
“fall through” the last statement. 

Conditional branches are handled by generating an expression node, CBRANCH, whose 
left descendant is the conditional expression and the right descendant is an ICON node con¬ 
taining the internal label number to be branched to. For efficiency, the semantics are that the 
label is gone to if the condition is false. 

The switch statement is compiled by collecting the case entries, and an indication as to 
whether there is a default case; an internal label number is generated for each of these, and 
remembered in a big array. The expression comprising the value to be switched on is compiled 
when the switch keyword is encountered, but the expression tree is headed by a special node, 
FORCE, which tells the code generator to put the expression value into a special distinguished 
register (this same mechanism is used for processing the return statement). When the end of 
the switch block is reached, the array containing the case values is sorted, and checked for 
duplicate entries (an error); if all is correct, the machine dependent routine gens witch is called, 
with this array of labels and values in increasing order. Genswitch can assume that the value 
to be tested is already in the register which is the usual integer return value register. 

Optimization 

There is a machine independent file, optim.c, which contains a relatively short optimiza¬ 
tion routine, optim. Actually the word optimization is something of a misnomer; the results 
are not optimum, only improved, and <the routine is in fact not optional; it must be called for 
proper operation of the compiler. 

Optim is called after an expression tree is built, but before the code generator is called. 
The essential part of its job is to call clocal on the conversion operators. On most machines, 
the treatment of & is also essential: by this time in the processing, the only node which is a 
legal descendant of & is NAME. (Possible descendants of * have been eliminated by buildtree.) 
The address of a static name is, almost by definition, a constant, and can be represented by an 
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ICON node on most machines (provided that the loader has enough power). Unfortunately, 
this is not universally true; on some machine, such as the IBM 370, the issue of addressability 
rears its ugly head; thus, before turning a NAME node into an ICON node, the machine depen¬ 
dent function andable is called. 

The optimization attempts of optim are currently quite limited. It is primarily concerned 
with improving the behavior of the compiler with operations one of whose arguments is a con¬ 
stant. In the simplest case, the constant is placed on the right if the operation is commutative. 
The compiler also makes a limited search for expressions such as 

( x + a ) + b 

where a and b are constants, and attempts to combine a and b at compile time. A number of 
special cases are also examined; additions of 0 and multiplications by 1 are removed, although 
the correct processing of these cases to get the type of the resulting tree correct is decidedly 
nontrivial. In some cases, the addition or multiplication must be replaced by a conversion op 
to keep the types from becoming fouled up. Finally, in cases where a relational operation is 
being done, and one operand is a constant, the operands are permuted, and the operator 
altered, if necessary, to put the constant on the right. Finally, multiplications by a power of 2 
are changed to shifts. 

There are dozens of similar optimizations that can be, and should be, done. It seems 
likely that this routine will be expanded in the relatively near future. 

Machine Dependent Stuff 

A number of the first pass machine dependent routines have been discussed above. In 
general, the routines are short, and easy to adapt from machine to machine. The two excep¬ 
tions to this general rule are clocal and the function prolog and epilog generation routines, 
bfcode and efcode. 

Clocal has the job of rewriting, if appropriate and desirable, the nodes constructed by 
buildtree. There are two major areas where this is important; NAME nodes and conversion 
operations. In the case of NAME nodes, clocal must rewrite the NAME node to reflect the 
actual physical location of the name in the machine. In effect, the NAME node must be exam¬ 
ined, the symbol table entry found (through the rval field of the node), and, based on the 
storage class of the node, the tree must be rewritten. Automatic variables and parameters are 
typically rewritten by treating the reference to the variable as a structure reference, off the 
register which holds the stack or argument pointer; the stref routine is set up to be called in 
this way, and to build the appropriate tree. In the most general case, the tree consists of a 
unary * node, whose descendant is a + node, with the stack or argument register as left 
operand, and a constant offset as right operand. In the case of LABEL and internal static 
nodes, the rval field is rewritten to be the negative of the internal label number; a negative rval 
field is taken to be an internal label number. Finally, a name of class REGISTER must be 
converted into a REG node, and the rval field replaced by the register number. In fact, this 
part of the clocal routine is nearly machine independent; only for machines with addressability 
problems (IBM 370 again!) does it have to be noticeably different, 

The conversion operator treatment is rather tricky. It is necessary to handle the applica¬ 
tion of conversion operators to constants in clocal, in order that all constant expressions can 
have their values known at compile time. In extreme cases, this may mean that some simula¬ 
tion of the arithmetic of the target machine might have to be done in a cross-compiler. In the 
most common case, conversions from pointer to pointer do nothing. For some machines, how¬ 
ever, conversion from byte pointer to short or long pointer might require a shift or rotate 
operation, which would have to be generated here. 

The extension of the portable compiler to machines where the size of a pointer depends 
on its type would be straightforward, but has not yet been done. 

The other major machine dependent issue involves the subroutine prolog and epilog gen¬ 
eration. The hard part here is the design of the stack frame and calling sequence; this design 
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issue is discussed elsewhere. 5 The routine bfcode is called with the number of arguments the 
function is defined with, and an array containing the symbol table indices of the declared 
parameters. Bfcode must generate the code to establish the new stack frame, save the return 
address and previous stack pointer value on the stack, and save whatever registers are to be 
used for register variables. The stack size and the number of register variables is not known 
when bfcode is called, so these numbers must be referred to by assembler constants, which are 
defined when they are known (usually in the second pass, after all register variables, automat¬ 
ics, and temporaries have been seen). The final job is to find those parameters which may 
have been declared register, and generate the code to initialize the register with the value 
passed on the stack. Once again, for most machines, the general logic of bfcode remains the 
same, but the contents of the printf calls in it will change from machine to machine, efcode is 
rather simpler, having just to generate the default return at the end of a function. This may be 
nontrivial in the case of a function returning a structure or union, however. 

There seems to be no really good place to discuss structures and unions, but this is as 
good a place as any. The C language now supports structure assignment, and the passing of 
structures as arguments to functions, and the receiving of structures back from functions. This 
was added rather late to C, and thus to the portable compiler. Consequently, it fits in less well 
than the older features. Moreover, most of the burden of making these features work is placed 
on the machine dependent code. 

There are both conceptual and practical problems. Conceptually, the compiler is struc¬ 
tured around the idea that to compute something, you put it into a register and work on it. 
This notion causes a bit of trouble on some machines (e.g., machines with 3-address opcodes), 
but matches many machines quite well. Unfortunately, this notion breaks down with struc¬ 
tures. The closest that one can come is to keep the addresses of the structures in registers. 
The actual code sequences used to move structures vary from the trivial (a multiple byte move) 
to the horrible (a function call), and are very machine dependent. 

The practical problem is more painful. When a function returning a structure is called, 
this function has to have some place to put the structure value. If it places it on the stack, it 
has difficulty popping its stack frame. If it places the value in a static temporary, the routine 
fails to be reentrant. The most logically consistent way of implementing this is for the caller 
to pass in a pointer to a spot where the called function should put the value before returning. 
This is relatively straightforward, although a bit tedious, to implement, but means that the 
caller must have properly declared the function type, even if the value is never used. On some 
machines, such as the Interdata 8/32, the return value simply overlays the argument region 
(which on the 8/32 is part of the caller’s stack frame). The caller takes care of leaving enough 
room if the returned value is larger than the arguments. This also assumes that the caller 
know and declares the function properly. 

The PDP-11 and the VAX have stack hardware which is used in function calls and 
returns; this makes it very inconvenient to use either of the above mechanisms. In these 
machines, a static area within the called functionis allocated, and the function return value is 
copied into it on return; the function returns the address of that region. This is simple to 
implement, but is non-reentrant. However, the function can now be called as a subroutine 
without being properly declared, without the disaster which would otherwise ensue. No matter 
what choice is taken, the convention is that the function actually returns the address of the 
return structure value. 

In building expression trees, the portable compiler takes a bit for granted about struc¬ 
tures. It assumes that functions returning structures actually return a pointer to the structure, 
and it assumes that a reference to a structure is actually a reference to its address. The struc¬ 
ture assignment operator is rebuilt so that the left operand is the structure being assigned to, 
but the right operand is the address of the structure being assigned; this makes it easier to deal 
with 

a = b = c 
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and similar constructions. 

There are four special tree nodes associated with these operations: STASG (structure 
assignment), STARG (structure argument to a function call), and STCALL and UNARY 
STCALL (calls of a function with nonzero and zero arguments, respectively). These four 
nodes are unique in that the size and alignment information, which can be determined by the 
type for all other objects in C, must be known to carry out these operations; special fields are 
set aside in these nodes to contain this information, and special intermediate code is used to 
transmit this information. 

First Pass Summary 

There are may other issues which have been ignored here, partly to justify the title 
“tour”, and partially because they have seemed to cause little trouble. There are some debug¬ 
ging flags which may be turned on, by giving the compiler’s first pass the argument 

-Xfflags] 

Some of the more interesting flags are -Xd for the defining and freeing of symbols, -Xi for ini¬ 
tialization comments, and -Xb for various comments about the building of trees. In many 
cases, repeating the flag more than once gives more information; thus, -Xddd gives more infor¬ 
mation than -Xd. In the two pass version of the compiler, the flags should not be set when 
the output is sent to the second pass, since the debugging output and the intermediate code 
both go onto the standard output. 

We turn now to consideration of the second pass. 

Pass Two 

Code generation is far less well understood than parsing or lexical analysis, and for this 
reason the second pass is far harder to discuss in a file by file manner. A great deal of the 
difficulty is in understanding the issues and the strategies employed to meet them. Any partic¬ 
ular function is likely to be reasonably straightforward. 

Thus, this part of the paper will concentrate a good deal on the broader aspects of stra¬ 
tegy in the code generator, and will not get too intimate with the details. 

Overview. 

It is difficult to organize a code generator to be flexible enough to generate code for a 
large number of machines, and still be efficient for any one of them. Flexibility is also impor¬ 
tant when it comes time to tune the code generator to improve the output code quality. On 
the other hand, too much flexibility can lead to semantically incorrect code, and potentially a 
combinatorial explosion in the number of cases to be considered in the compiler. 

One goal of the code generator is to have a high degree of correctness. It is very desirable 
to have the compiler detect its own inability to generate correct code, rather than to produce 
incorrect code. This goal is achieved by having a simple model of the job to be done (e.g., an 
expression tree) and a simple model of the machine state (e.g., which registers are free). The 
act of generating an instruction performs a transformation on the tree and the machine state; 
hopefully, the tree eventually gets reduced to a single node. If each of these 
instruction/transformation pairs is correct, and if the machine state model really represents the 
actual machine, and if the transformations reduce the input tree to the desired single node, 
then the output code will be correct. 

For most real machines, there is no definitive theory of code generation that encompasses 
all the C operators. Thus the selection of which instruction/transformations to generate, and 
in what order, will have a heuristic flavor. If, for some expression tree, no transformation 
applies, or, more seriously, if the heuristics select a sequence of instruction/transformations 
that do not in fact reduce the tree, the compiler will report its inability to generate code, and 
abort. 
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A major part of the code generator is concerned with the model and the transformations, 
— most of this is machine independent, or depends only on simple tables. The flexibility 
comes from the heuristics that guide the transformations of the trees, the selection of subgoals, 
and the ordering of the computation. 

The Machine Model 

The machine is assumed to have a number of registers, of at most two different types: A 
and B. Within each register class, there may be scratch (temporary) registers and dedicated 
registers (e.g., register variables, the stack pointer, etc.). Requests to allocate and free registers 
involve only the temporary registers. 

Each of the registers in the machine is given a name and a number in the mac2defs file; 
the numbers are used as indices into various tables that describe the registers, so they should 
be kept small. One such table is the rstatus table on file local2.c. This table is indexed by 
register number, and contains expressions made up from manifest constants describing the 
register types: SAREG for dedicated AREG’s, SAREGISTAREG for scratch AREGS’s, and 
SBREG and SBREG1STBREG similarly for BREG’s. There are macros that access this infor¬ 
mation: isbreg(r) returns true if register number r is a BREG, and istreg(r) returns true if regis¬ 
ter number r is a temporary AREG or BREG. Another table, marries, contains the register 
names; this is used when putting out assembler code and diagnostics. 

The usage of registers is kept track of by an array called busy. Busy[r] is the number of 
uses of register r in the current tree being processed. The allocation and freeing of registers 
will be discussed later as part of the code generation algorithm. 

General Organization 

As mentioned above, the second pass reads lines from the intermediate file, copying 
through to the output unchanged any lines that begin with a *)*, and making note of the infor¬ 
mation about stack usage and register allocation contained on lines beginning with *]’ and ‘[\ 
The expression trees, whose beginning is indicated by a line beginning with V, are read and 
rebuilt into trees. If the compiler is loaded as one pass, the expression trees are immediately 
available to the code generator. 

The actual code generation is done by a hierarchy of routines. The routine delay is first 
given the tree; it attempts to delay some postfix ++ and — computations that might reason¬ 
ably be done after the smoke clears. It also attempts to handle comma (,) operators by com¬ 
puting the left side expression first, and then rewriting the tree to eliminate the operator. 
Delay calls codgen to control the actual code generation process. Codgen takes as arguments a 
pointer to the expression tree, and a second argument that, for socio-historical reasons, is 
called a cookie. The cookie describes a set of goals that would be acceptable for the code gen¬ 
eration: these are assigned to individual bits, so they may be logically or’ed together to form a 
large number of possible goals. Among the possible goals are FOREFF (compute for side 
effects only; don’t worry about the value), INTEMP (compute and store value into a temporary 
location in memory), IN AREG (compute into an A register), INTAREG (compute into a 
scratch A register), INBREG and INTBREG similarly, FORCC (compute for condition codes), 
and FORARG (compute it as a function argument; e.g., stack it if appropriate). 

Codgen first canonicalizes the tree by calling canon. This routine looks for certain 
transformations that might now be applicable to the tree. One, which is very common and 
very powerful, is to fold together an indirection operator (UNARY MUL) and a register 
(REG); in most machines, this combination is addressable directly, and so is similar to a 
NAME in its behavior. The UNARY MUL and REG are folded together to make another 
node type called OREG. In fact, in many machines it is possible to directly address not just 
the cell pointed to by a register, but also cells differing by a constant offset from the cell 
pointed to by the register. Canon also looks for such cases, calling the machine dependent rou¬ 
tine notoff to decide if the offset is acceptable (for example, in the IBM 370 the offset must be 
between 0 and 4095 bytes). Another optimization is to replace bit field operations by shifts 
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and masks if the operation involves extracting the field. Finally, a machine dependent routine, 
sucomp, is called that computes the Sethi-Ullman numbers for the tree (see below). 

After the tree is canonicalized, codgen calls the routine store whose job is to select a sub¬ 
tree of the tree to be computed and (usually) stored before beginning the computation of the 
full tree. Store must return a tree that can be computed without need for any temporary 
storage locations. In effect, the only store operations generated while processing the subtree 
must be as a response to explicit assignment operators in the tree. This division of the job 
marks one of the more significant, and successful, departures from most other compilers. It 
means that the code generator can operate under the assumption that there are enough regis¬ 
ters to do its job, without worrying about temporary storage. If a store into a temporary 
appears in the output, it is always as a direct result of logic in the store routine; this makes 
debugging easier. 

One consequence of this organization is that code is not generated by a treewalk. There 
are theoretical results that support this decision. 7 It may be desirable to compute several sub¬ 
trees and store them before tackling the whole tree; if a subtree is to be stored, this is known 
before the code generation for the subtree is begun, and the subtree is computed when all 
scratch registers are available. 

The store routine decides what subtrees, if any, should be stored by making use of 
numbers, called Sethi-Ullman numbers, that give, for each subtree of an expression tree, the 
minimum number of scratch registers required to compile the subtree, without any stores into 
temporaries. 8 These numbers are computed by the machine-dependent routine sucomp, called 
by canon. The basic notion is that, knowing the Sethi-Ullman numbers for the descendants of 
a node, and knowing the operator of the node and some information about the machine, the 
Sethi-Ullman number of the node itself can be computed. If the Sethi-Ullman number for a 
tree exceeds the number of scratch registers available, some subtree must be stored. Unfor¬ 
tunately, the theory behind the Sethi-Ullman numbers applies only to uselessly simple 
machines and operators. For the rich set of C operators, and for machines with asymmetric 
registers, register pairs, different kinds of registers, and exceptional forms of addressing, the 
theory cannot be applied directly. The basic idea of estimation is a good one, however, and 
well worth applying; the application, especially when the compiler comes to be tuned for high 
code quality, goes beyond the park of theory into the swamp of heuristics. This topic will be 
taken up again later, when more of the compiler structure has been described. 

After examining the Sethi-Ullman numbers, store selects a subtree, if any, to be stored, 
and returns the subtree and the associated cookie in the external variables stotree and stocook. 
If a subtree has been selected, or if the whole tree is ready to be processed, the routine order is 
called, with a tree and cookie. Order generates code for trees that do not require temporary 
locations. Order may make recursive calls on itself, and, in some cases, on codgen', for example, 
when processing the operators &&, I, and comma (*,’), that have a left to right evaluation, it is 
incorrect for store examine the right operand for subtrees to be stored. In these cases, order 
will call codgen recursively when it is permissible to work on the right operand. A similar issue 
arises with the ? : operator. 

The order routine works by matching the current tree with a set of code templates. If a 
template is discovered that will match the current tree and cookie, the associated assembly 
language statement or statements are generated. The tree is then rewritten, as specified by the 
template, to represent the effect of the output instruction(s). If no template match is found, 
first an attempt is made to find a match with a different cookie; for example, in order to com¬ 
pute an expression with cookie INTEMP (store into a temporary storage location), it is usually 
necessary to compute the expression into a scratch register first. If all attempts to match the 
tree fail, the heuristic part of the algorithm becomes dominant. Control is typically given to 
one of a number of machine-dependent routines that may in turn recursively call order to 
achieve a subgoal of the computation (for example, one of the arguments may be computed 
into a temporary register). After this subgoal has been achieved, the process begins again with 
the modified tree. If the machine-dependent heuristics are unable to reduce the tree further, a 
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number of default rewriting rules may be considered appropriate. For example, if the left 
operand of a + is a scratch register, the + can be replaced by a += operator; the tree may 
then match a template. 

To close this introduction, we will discuss the steps in compiling code for the expression 

a += b 

where a and b are static variables. 

To begin with, the whole expression tree is examined with cookie FOREFF, and no 
match is found. Search with other cookies is equally fruitless, so an attempt at rewriting is 
made. Suppose we are dealing with the Interdata 8/32 for the moment. It is recognized that 
the left hand and right hand sides of the + = operator are addressable, and in particular the 
left hand side has no side effects, so it is permissible to rewrite this as 

a = a + b 

and this is done. No match is found on this tree either, so a machine dependent rewrite is 
done; it is recognized that the left hand side of the assignment is addressable, but the right 
hand side is not in a register, so order is called recursively, being asked to put the right hand 
side of the assignment into a register. This invocation of order searches the tree for a match, 
and fails. The machine dependent rule for + notices that the right hand operand is address¬ 
able; it decides to put the left operand into a scratch register. Another recursive call to order is 
made, with the tree consisting solely of the leaf a, and the cookie asking that the value be 
placed into a scratch register. This now matches a template, and a load instruction is emitted. 
The node consisting of a is rewritten in place to represent the register into which a is loaded, 
and this third call to order returns. The second call to order now finds that it has the tree 

reg -I- b 

to consider. Once again, there is no match, but the default rewriting rule rewrites the + as a 
+ = operator, since the left operand is a scratch register. When this is done, there is a match: 
in fact, 

reg += b 

simply describes the effect of the add instruction on a typical machine. After the add is emit¬ 
ted, the tree is rewritten to consist merely of the register node, since the result of the add is 
now in the register. This agrees with the cookie passed to the second invocation of order, so 
this invocation terminates, returning to the first level. The original tree has now become 

a = reg 

which matches a template for the store instruction. The store is output, and the tree rewritten 
to become just a single register node. At this point, since the top level call to order was 
interested only in side effects, the call to order returns, and the code generation is completed; 
we have generated a load, add, and store, as might have been expected. 

The effect of machine architecture on this is considerable. For example, on the 
Honeywell 6000, the machine dependent heuristics recognize that there is an “add to storage” 
instruction, so the strategy is quite different; b is loaded in to a register, and then an add to 
storage instruction generated to add this register in to a. The transformations, involving as 
they do the semantics of C, are largely machine independent. The decisions as to when to use 
them, however, are almost totally machine dependent. 

Having given a broad outline of the code generation process, we shall next consider the 
heart of it: the templates. This leads naturally into discussions of template matching and 
register allocation, and finally a discussion of the machine dependent interfaces and strategies. 
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The Templates 

The templates describe the effect of the target machine instructions on the model of com¬ 
putation around which the compiler is organized. In effect, each template has five logical sec¬ 
tions, and represents an assertion of the form: 

If we have a subtree of a given shape (1), and we have a goal (cookie) or goals to achieve 
(2), and we have sufficient free resources (3), then we may emit an instruction or instruc¬ 
tions (4), and rewrite the subtree in a particular manner (5), and the rewritten tree will 
achieve the desired goals. 

These five sections will be discussed in more detail later. First, we give an example of a 
template: 

ASG PLUS, INAREG, 

SAREG, TINT, 

SNAME, TINT, 

0, RLEFT, 

" add AL,AR\n", 

The top line specifies the operator (+ = ) and the cookie (compute the value of the subtree into 
an AREG). The second and third lines specify the left and right descendants, respectively, of 
the += operator. The left descendant must be a REG node, representing an A register, and 
have integer type, while the right side must be a NAME node, and also have integer type. The 
fourth line contains the resource requirements (no scratch registers or temporaries needed), 
and the rewriting rule (replace the subtree by the left descendant). Finally, the quoted string 
on the last line represents the output to the assembler: lower case letters, tabs, spaces, etc. are 
copied verbatim, to the output; upper case letters trigger various macro-like expansions. Thus, 

AL would expand into the Address form of the Left operand — presumably the register 
number. Similarly, AR would expand into the name of the right operand. The add instruction 
of the last section might well be emitted by this template. 

In principle, it would be possible to make separate templates for all legal combinations of 
operators, cookies, types, and shapes. In practice, the number of combinations is very large. 

Thus, a considerable amount of mechanism is present to permit a large number of subtrees to 
be matched by a single template. Most of the shape and type specifiers are individual bits, and 
can be logically or’ed together. There are a number of special descriptors for matching classes 
of operators. The cookies can also be combined. As an example of the kind of template that 
really arises in practice, the actual template for the Interdata 8/32 that subsumes the above 
example is: 

ASG OPSIMP, INAREGIFORCC, 

SAREG, TINTTTUNSIGNEDTTPOINT, 

SAREGISNAMEISOREGISCON, TINTTTUN SIGNEDTTPOINT, 

0, RLEFT1RESCC, 

" 01 AL,AR\n", 

Here, OPSIMP represents the operators +, -, I, &, and \ The OI macro in the output string 
expands into the appropriate Integer Opcode for the operator. The left and right sides can be 
integers, unsigned, or pointer types. The right side can be, in addition to a name, a register, a 
memory location whose address is given by a register and displacement (OREG), or a constant. 
Finally, these instructions set the condition codes, and so can be used in condition contexts: 
the cookie and rewriting rules reflect this. 

The Template Matching Algorithm. 

The heart of the second pass is the template matching algorithm, in the routine match. 

Match is called with a tree and a cookie; it attempts to match the given tree against some tem¬ 
plate that will transform it according to one of the goals given in the cookie. If a match is suc¬ 
cessful, the transformation is applied; expand is called to generate the assembly code, and then 
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reclaim rewrites the tree, and reclaims the resources, such as registers, that might have become 
free as a result of the generated code. 

This part of the compiler is among the most time critical. There is a spectrum of imple¬ 
mentation techniques available for doing this matching. The most naive algorithm simply 
looks at the templates one by one. This can be considerably improved upon by restricting the 
search for an acceptable template. It would be possible to do better than this if the templates 
were given to a separate program that ate them and generated a template matching subroutine. 
This would make maintenance of the compiler much more complicated, however, so this has 
not been done. 

The matching algorithm is actually carried out by restricting the range in the table that 
must be searched for each opcode. This introduces a number of complications, however, and 
needs a bit of sympathetic help by the person constructing the compiler in order to obtain best 
results. The exact tuning of this algorithm continues; it is best to consult the code and com¬ 
ments in match for the latest version. 

In order to match a template to a tree, it is necessary to match not only the cookie and 
the op of the root, but also the types and shapes of the left and right descendants (if any) of 
the tree. A convention is established here that is carried out throughout the second pass of the 
compiler. If a node represents a unary operator, the single descendant is always the “left” des¬ 
cendant. If a node represents a unary operator or a leaf node (no descendants) the “right” des¬ 
cendant is taken by convention to be the node itself. This enables templates to easily match 
leaves and conversion operators, for example, without any additional mechanism in the match¬ 
ing program. 

The type matching is straightforward; it is possible to specify any combination of basic 
types, general pointers, and pointers to one or more of the basic types. The shape matching is 
somewhat more complicated, but still pretty simple. Templates have a collection of possible 
operand shapes on which the opcode might match. In the simplest case, an add operation 
might be able to add to either a register variable or a scratch register, and might be able (with 
appropriate help from the assembler) to add an integer constant (ICON), a static memory cell 
(NAME), or a stack location (OREG). 

It is usually attractive to specify a number of such shapes, and distinguish between them 
when the assembler output is produced. It is possible to describe the union of many elemen¬ 
tary shapes such as ICON, NAME, OREG, AREG or BREG (both scratch and register forms), 
etc. To handle at least the simple forms of indirection, one can also match some more compli¬ 
cated forms of trees; STARNM and STARREG can match more complicated trees headed by 
an indirection operator, and SFLD can match certain trees headed by a FLD operator: these 
patterns call machine dependent routines that match the patterns of interest on a given 
machine. The shape SWADD may be used to recognize NAME or OREG nodes that lie on 
word boundaries: this may be of some importance on word-addressed machines. Finally, there 
are some special shapes: these may not be used in conjunction with the other shapes, but may 
be defined and extended in machine dependent ways. The special shapes SZERO, SONE, and 
SMONE are predefined and match constants 0, 1, and -1, respectively; others are easy to add 
and match by using the machine dependent routine special. 

When a template has been found that matches the root of the tree, the cookie, and the 
shapes and types of the descendants, there is still one bar to a total match: the template may 
call for some resources (for example, a scratch register). The routine alio is called, and it 
attempts to allocate the resources. If it cannot, the match fails; no resources are allocated. If 
successful, the allocated resources are given numbers 1, 2, etc. for later reference when the 
assembly code is generated. The routines expand and reclaim are then called. The match rou¬ 
tine then returns a special value, MDONE. If no match was found, the value MNOPE is 
returned; this is a signal to the caller to try more cookie values, or attempt a rewriting rule. 
Match is also used to select rewriting rules, although the way of doing this is pretty straightfor¬ 
ward. A special cookie, FORREW, is used to ask match to search for a rewriting rule. The 
rewriting rules are keyed to various opcodes; most are carried out in order. Since the question 
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of when to rewrite is one of the key issues in code generation, it will be taken up again later. 


Register Allocation. 

The register allocation routines, and the allocation strategy, play a central role in the 
correctness of the code generation algorithm. If there are bugs in the Sethi-Ullman computa¬ 
tion that cause the number of needed registers to be underestimated, the compiler may run out 
of scratch registers; it is essential that the allocator keep track of those registers that are free 
and busy, in order to detect such conditions. 

Allocation of registers takes place as the result of a template match; the routine alio is 
called with a word describing the number of A registers, B registers, and temporary locations 
needed. The allocation of temporary locations on the stack is relatively straightforward, and 
will not be further covered; the bookkeeping is a bit tricky, but conceptually trivial, and 
requests for temporary space on the stack will never fail. 

Register allocation is less straightforward. The two major complications are pairing and 
sharing. In many machines, some operations (such as multiplication and division), and/or 
some types (such as longs or double precision) require even/odd pairs of registers. Operations 
of the first type are exceptionally difficult to deal with in the compiler; in fact, their theoretical 
properties are rather bad as well. 9 The second issue is dealt with rather more successfully; a 
machine dependent function called szty(t) is called that returns 1 or 2, depending on the 
number of A registers required to hold an object of type t. If szty returns 2, an even/odd pair 
of A registers is allocated for each request. 

The other issue, sharing, is more subtle, but important for good code quality. When 
registers are allocated, it is possible to reuse registers that hold address information, and use 
them to contain the values computed or accessed. For example, on the IBM 360, if register 2 
has a pointer to an integer in it, we may load the integer into register 2 itself by saying: 

L 2,0(2) 

If register 2 had a byte pointer, however, the sequence for loading a character involves clearing 
the target register first, and then inserting the desired character: 

SR 3,3 
IC 3,0(2) 

In the first case, if register 3 were used as the target, it would lead to a larger number of regis¬ 
ters used for the expression than were required; the compiler would generate inefficient code. 
On the other hand, if register 2 were used as the target in the second case, the code would sim¬ 
ply be wrong. In the first case, register 2 can be shared while in the second, it cannot. 

In the specification of the register needs in the templates, it is possible to indicate 
whether required scratch registers may be shared with possible registers on the left or the right 
of the input tree. In order that a register be shared, it must be scratch, and it must be used 
only once, on the appropriate side of the tree being compiled. 

The alio routine thus has a bit more to do than meets the eye; it calls freereg to obtain a 
free register for each A and B register request. Freereg makes multiple calls on the routine 
usable to decide if a given register can be used to satisfy a given need. Usable calls shareit if 
the register is busy, but might be shared. Finally, shareit calls ushare to decide if the desired 
register is actually in the appropriate subtree, and can be shared. 

Just to add additional complexity, on some machines (such as the IBM 370) it is possible 
to have “double indexing” forms of addressing; these are represented by OREGS’s with the 
base and index registers encoded into the register field. While the register allocation and deal¬ 
location per se is not made more difficult by this phenomenon, the code itself is somewhat 
more complex. 

Having allocated the registers and expanded the assembly language, it is time to reclaim 
the resources; the routine reclaim does this. Many operations produce more than one result. 
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For example, many arithmetic operations may produce a value in a register, and also set the 
condition codes. Assignment operations may leave results both in a register and in memory. 
Reclaim is passed three parameters; the tree and cookie that were matched, and the rewriting 
field of the template. The rewriting field allows the specification of possible results; the tree is 
rewritten to reflect the results of the operation. If the tree was computed for side effects only 
(FOREFF), the tree is freed, and all resources in it reclaimed. If the tree was computed for 
condition codes, the resources are also freed, and the tree replaced by a special node type, 
FORCC. Otherwise, the value may be found in the left argument of the root, the right argu¬ 
ment of the root, or one of the temporary resources allocated. In these cases, first the 
resources of the tree, and the newly allocated resources, are freed; then the resources needed by 
the result are made busy again. The final result must always match the shape of the input 
cookie; otherwise, the compiler error “cannot reclaim” is generated. There are some machine 
dependent ways of preferring results in registers or memory when there are multiple results 
matching multiple goals in the cookie. 

The Machine Dependent Interface 

The files order.c, local2.c, and table.c, as well as the header file mac2defs, represent the 
machine dependent portion of the second pass. The machine dependent portion can be 
roughly divided into two: the easy portion and the hard portion. The easy portion tells the 
compiler the names of the registers, and arranges that the compiler generate the proper assem¬ 
bler formats, opcode names, location counters, etc. The hard portion involves the 
Sethi-Ullman computation, the rewriting rules, and, to some extent, the templates. It is hard 
because there are no real algorithms that apply; most of this portion is based on heuristics. 
This section discusses the easy portion; the next several sections will discuss the hard portion. 

If the compiler is adapted from a compiler for a machine of similar architecture, the easy 
part is indeed easy. In mac2defs, the register numbers are defined, as well as various parame¬ 
ters for the stack frame, and various macros that describe the machine architecture. If double 
indexing is to be permitted, for example, the symbol R2REGS is defined. Also, a number of 
macros that are involved in function call processing, especially for unusual function call 
mechanisms, are defined here. 

In local2.c, a large number of simple functions are defined. These do things such as write 
out opcodes, register names, and address forms for the assembler. Part of the function call 
code is defined here; that is nontrivial to design, but typically rather straightforward to imple¬ 
ment. Among the easy routines in order.c are routines for generating a created label, defining a 
label, and generating the arguments of a function call. 

These routines tend to have a local effect, and depend on a fairly straightforward way on 
the target assembler and the design decisions already made about the compiler. Thus they will 
not be further treated here. 

The Rewriting Rules 

When a tree fails to match any template, it becomes a candidate for rewriting. Before 
the tree is rewritten, the machine dependent routine nextcook is called with the tree and the 
cookie; it suggests another cookie that might be a better candidate for the matching of the tree. 
If all else fails, the templates are searched with the cookie FORREW, to look for a rewriting 
rule. The rewriting rules are of two kinds; for most of the common operators, there are 
machine dependent rewriting rules that may be applied; these are handled by machine depen¬ 
dent functions that are called and given the tree to be computed. These routines may recur¬ 
sively call order or codgen to cause certain subgoals to be achieved; if they actually call for 
some alteration of the tree, they return 1, and the code generation algorithm recanonicalizes 
and tries again. If these routines choose not to deal with the tree, the default rewriting rules 
are applied. 

The assignment ops, when rewritten, call the routine setasg. This is assumed to rewrite 
the tree at least to the point where there are no side effects in the left hand side. If there is 
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still no template match, a default rewriting is done that causes an expression such as 
a + = b 

to be rewritten as 
a = a + b 

This is a useful default for certain mixtures of strange types (for example, when a is a bit field 
and b an character) that otherwise might need separate table entries. 

Simple assignment, structure assignment, and all forms of calls are handled completely by 
the machine dependent routines. For historical reasons, the routines generating the calls 
return 1 on failure, 0 on success, unlike the other routines. 

The machine dependent routine setbin handles binary operators; it too must do most of 
the job. In particular, when it returns 0, it must do so with the left hand side in a temporary 
register. The default rewriting rule in this case is to convert the binary operator into the asso¬ 
ciated assignment operator; since the left hand side is assumed to be a temporary register, this 
preserves the semantics and often allows a considerable saving in the template table. 

The increment and decrement operators may be dealt with with the machine dependent 
routine setincr. If this routine chooses not to deal with the tree, the rewriting rule replaces 

x ++ 

by 

((x += 1) - 1) 

which preserves the semantics. Once again, this is not too attractive for the most common 
cases, but can generate close to optimal code when the type of x is unusual. 

Finally, the indirection (UNARY MUL) operator is also handled in a special way. The 
machine dependent routine offstar is extremely important for the efficient generation of code. 
Offstar is called with a tree that is the direct descendant of a UNARY MUL node; its job is to 
transform this tree so that the combination of UNARY MUL with the transformed tree 
becomes addressable. On most machines, offstar can simply compute the tree into an A or B 
register, depending on the architecture, and then canon will make the resulting tree into an 
OREG. On many machines, offstar can profitably choose to do less work than computing its 
entire argument into a register. For example, if the target machine supports OREGS with a 
constant offset from a register, and offstar is called with a tree of the form 

expr + const 

where const is a constant, then offstar need only compute expr into the appropriate form of 
register. On machines that support double indexing, offstar may have even more choice as to 
how to proceed. The proper tuning of offstar, which is not typically too difficult, should be one 
of the first tries at optimization attempted by the compiler writer. 

The Sethi-Ullman Computation 

The heart of the heuristics is the computation of the Sethi-Ullman numbers. This com¬ 
putation is closely linked with the rewriting rules and the templates. As mentioned before, the 
Sethi-Ullman numbers are expected to estimate the number of scratch registers needed to com¬ 
pute the subtrees without using any stores. However, the original theory does not apply to real 
machines. For one thing, the theory assumes that all registers are interchangeable. Real 
machines have general purpose, floating point, and index registers, register pairs, etc. The 
theory also does not account for side effects; this rules out various forms of pathology that 
arise from assignment and assignment ops. Condition codes are also undreamed of. Finally, 
the influence of types, conversions, and the various addressability restrictions and extensions 
of real machines are also ignored. 
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Nevertheless, for a “useless” theory, the basic insight of Sethi and Ullman is amazingly 
useful in a real compiler. The notion that one should attempt to estimate the resource needs 
of trees before starting the code generation provides a natural means of splitting the code gen¬ 
eration problem, and provides a bit of redundancy and self checking in the compiler. More¬ 
over, if writing the Sethi-Ullman routines is hard, describing, writing, and debugging the alter¬ 
native (routines that attempt'to free up registers by stores into temporaries “on the fly”) is 
even worse. Nevertheless, it should be clearly understood that these routines exist in a realm 
where there is no right” way to write them; it is an art, the realm of heuristics, and, conse¬ 
quently, a major source of bugs in the compiler. Often, the early, crude versions of these rou¬ 
tines give little trouble; only after the compiler is actually working and the code quality is being 
improved do serious problem have to be faced. Having a simple, regular machine architecture 
is worth quite a lot at this time. 

The major problems arise from asymmetries in the registers: register pairs, having 
different kinds of registers, and the related problem of needing more than one register (fre¬ 
quently a pair) to store certain data types (such as longs or doubles). There appears to be no 
general way of treating this problem; solutions have to be fudged for each machine where, the 
problem arises. On the Honeywell 66, for example, there are only two general purpose regis¬ 
ters, so a need for a pair is the same as the need for two registers. On the IBM 370, the regis¬ 
ter pair (0,1) is used to do multiplications and divisions; registers 0 and 1 are not generally 
considered part of the scratch registers, and so do not require allocation explicitly. On the 
Interdata 8/32, after much consideration, the decision was made not to try to deal with the 
register pair issue; operations such as multiplication and division that required pairs were sim¬ 
ply assumed to take all of the scratch registers. Several weeks of effort had failed to produce 
an algorithm that seemed to have much chance of running successfully without inordinate 
debugging effort. The difficulty of this issue should not be minimized; it represents one of the 
main intellectual efforts in porting the compiler. Nevertheless, this problem has been fudged 
with a degree of success on nearly a dozen machines, so the compiler writer should not aban¬ 
don hope. 

The Sethi-Ullman computations interact with the rest of the compiler in a number of 
rather subtle ways. As already discussed, the store routine uses the Sethi-Ullman numbers to 
decide which subtrees are too difficult to compute in registers, and must be stored. There are 
also subtle interactions between the rewriting routines and the Sethi-Ullman numbers. Sup¬ 
pose we have a tree such as 

A - B 

where A and B are expressions; suppose further that B takes two registers, and A one. It is 
possible to compute the full expression in two registers by first computing B, and then, using 
the scratch register used by B, but not containing the answer, compute A. The subtraction can 
then be done, computing the expression. (Note that this assumes a number of things, not the 
least of which are register-to-register subtraction operators and symmetric registers.) If the 
machine dependent routine setbin, however, is not prepared to recognize this case and compute 
the more difficult side of the expression first, the Sethi-Ullman number must be set to three. 
Thus, the Sethi-Ullman number for a tree should represent the code that the machine depen¬ 
dent routines are actually willing to generate. 

The interaction can go the other way. If we take an expression such as 
*(p + i) 

where p is a pointer and i an integer, this can probably be done in one register on most 
machines. Thus, its Sethi-Ullman number would probably be set to one. If double indexing is 
possible in the machine, a possible way of computing the expression is to load both p and i 
into registers, and then use double indexing. This would use two scratch registers; in such a 
case, it is possible that the scratch registers might be unobtainable, or might make some other 
part of the computation run out of registers. The usual solution is to cause offstar to ignore 
opportunities for double indexing that would tie up more scratch registers than the Sethi- 
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Ullman number had reserved. 

In summary, the Sethi-Ullman computation represents much of the craftsmanship and 
artistry in any application of the portable compiler. It is also a frequent source of bugs. Algo¬ 
rithms are available that will produce nearly optimal code for specialized machines, but unfor¬ 
tunately most existing machines are far removed from these ideals. The best way of proceed¬ 
ing in practice is to start with a compiler for a similar machine to the target, and proceed very 
carefully. 

Register Allocation 

After the Sethi-Ullman numbers are computed, order calls a routine, rallo, that does 
register allocation, if appropriate. This routine does relatively little, in general; this is espe¬ 
cially true if the target machine is fairly regular. There are a few cases where it is assumed 
that the result of a computation takes place in a particular register; switch and function return 
are the two major places. The expression tree has a field, rail, that may be filled with a regis¬ 
ter number; this is taken to be a preferred register, and the first temporary register allocated by 
a template match will be this preferred one, if it is free. If not, no particular action is taken; 
this is just a heuristic. If no register preference is present, the field contains NOPREF. In 
some cases, the result must be placed in a given register, no matter what. The register number 
is placed in rail, and the mask MUSTDO is logically or’ed in with it. In this case, if the sub¬ 
tree is requested in a register, and comes back in a register other than the demanded one, it is 
moved by calling the routine rmove. If the target register for this move is busy, it is a compiler 
error. 

Note that this mechanism is the only one that will ever cause a register-to-register move 
between scratch registers (unless such a move is buried in the depths of some template). This 
simplifies debugging. In some cases, there is a rather strange interaction between the register 
allocation and the Sethi-Ullman number; if there is an operator or situation requiring a partic¬ 
ular register, the allocator and the Sethi-Ullman computation must conspire to ensure that the 
target register is not being used by some intermediate result of some far-removed computation. 
This is most easily done by making the special operation take all of the free registers, prevent¬ 
ing any other partially-computed results from cluttering up the works. 

Compiler Bugs 

The portable compiler has an excellent record of generating correct code. The require¬ 
ment for reasonable cooperation between the register allocation, Sethi-Ullman computation, 
rewriting rules, and templates builds quite a bit of redundancy into the compiling process. The 
effect of this is that, in a surprisingly short time, the compiler will start generating correct code 
for those programs that it can compile. The hard part of the job then becomes finding and 
eliminating those situations where the compiler refuses to compile a program because it knows 
it cannot do it right. For example, a template may simply be missing; this may either give a 
compiler error of the form “no match for op ...” , or cause the compiler to go into an infinite 
loop applying various rewriting rules. The compiler has a variable, nrecur, that is set to 0 at 
the beginning of an expressions, and incremented at key spots in the compilation process; if 
this parameter gets too large, the compiler decides that it is in a loop, and aborts. Loops are 
also characteristic of botches in the machine-dependent rewriting rules. Bad Sethi-Ullman 
computations usually cause the scratch registers to run out; this often means that the Sethi- 
Ullman number was underestimated, so store did not store something it should have; alterna¬ 
tively, it can mean that the rewriting rules were not smart enough to find the sequence that 
sucomp assumed would be used. 

The best approach when a compiler error is detected involves several stages. First, try to 
get a small example program that steps on the bug. Second, turn on various debugging flags in 
the code generator, and follow the tree through the process of being matched and rewritten. 
Some flags of interest are -e, which prints the expression tree, -r, which gives information 
about the allocation of registers, -a, which gives information about the performance of rallo, 
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and -o, which gives. information about the behavior of order. 
most bugs to be found relatively quickly. 


This technique should allow 


Unfortunately, finding the bug is usually not enough; it must also be fixed! The difficulty 
arises because a fix to the particular bug of interest tends to break other code that already 
works. Regression tests, tests that compare the performance of a new compiler against the 
performance of an older one, are very valuable in preventing major catastrophes. 

Summary and Conclusion 

The portable compiler has been a useful tool for providing C capability on a large number 
of diverse machines, and for testing a number of theoretical constructs in a practical setting. 
It has many blemishes, both in style and functionality. It has been applied to many more 
machines than first anticipated, of a much wider range than originally dreamed of. Its use has 
also spread much faster than expected, leaving parts of the compiler still somewhat raw in 
shape. 

On the theoretical side, there is some hope that the skeleton of the sucomp routine could 
be generated for many machines directly from the templates; this would give a considerable 
boost to the portability and correctness of the compiler, but might affect tunability and code 
quality. There is also room for more optimization, both within optim and in the form of a 
portable “peephole” optimizer. 

On the practical, development side, the compiler could probably be sped up and made 
smaller without doing too much violence to its basic structure. Parts of the compiler deserve 
to be rewritten; the initialization code, register allocation, and parser are prime candidates. It 
might be that doing some or all of the parsing with a recursive descent parser might save 
enough space and time to be worthwhile; it would certainly ease the problem of moving the 
compiler to an environment where Yacc is not already present. 

Finally, I would like to thank the many people who have sympathetically, and even 
enthusiastically, helped me grapple with what has been a frustrating program to write, test, 
and install. D. M. Ritchie and E. N. Pinson provided needed early encouragement and philo¬ 
sophical guidance; M. E. Lesk, R. Muha, T. G. Peterson, G. Riddle, L. Rosier, R. W. Mitze, B. 
R. Rowland, S. I. Feldman, and T. B. London have all contributed ideas, gripes, and all, at one 
time or another, climbed “into the pits” with me to help debug. Without their help this effort 
would have not been possible; with it, it was often kind of fun. 



79 






References 

1. B. W. Kernighan and D. M. Ritchie, The C Programming Language, Prentice-Hall, Engle¬ 
wood Cliffs, New Jersey, 1978. 

2. S. C. Johnson, “Lint, a C Program Checker,” Comp. Sci. Tech. Rep. No. 65, 1978. 
updated version TM 78-1273-3 

3. A. Snyder, A Portable Compiler for the Language C, Master’s Thesis, M.I.T., Cambridge, 
Mass., 1974. 

4. S. C. Johnson, “A Portable Compiler: Theory and Practice,” Proc. 5th ACM Symp. on 
Principles of Programming Languages, pp. 97-104, January 1978. 

5. M. E. Lesk, S. C. Johnson, and D. M. Ritchie, The C Language Calling Sequence, 1977. 

6. S. C. Johnson, “Yacc — Yet Another Compiler-Compiler,” Comp. Sci. Tech. Rep. No. 
32, Bell Laboratories, Murray Hill, New Jersey, July 1975. 

7. A. V. Aho and S. C. Johnson, “Optimal Code Generation for Expression Trees,” J. Assoc. 
Comp. Mach., vol. 23, no. 3, pp. 488-501, 1975. Also in Proc. ACM Symp. on Theory of 
Computing, pp. 207-217, 1975. 

8. R. Sethi and J. D. Ullman, “The Generation of Optimal Code for Arithmetic Expres¬ 
sions,” J. Assoc. Comp. Mach., vol. 17, no. 4, pp. 715-728, October 1970. Reprinted as pp. 
229-247 in Compiler Techniques, ed. B. W. Pollack, Auerbach, Princeton NJ (1972). 

9. A. V. Aho, S. C. Johnson, and J. D. Ullman, “Code Generation for Machines with Mul¬ 
tiregister Operations,” Proc. 4th ACM Symp. on Principles of Programming Languages, 
pp. 21-28, January 1977. 







A Tour through the UNIXf C Compiler 


D. M. Ritchie 

Languages 

Computing 


The Intermediate Language 

Communication between the two phases of the compiler proper is carried out by means of 
a pair of intermediate files. These files are treated as having identical structure, although the 
second file contains only the code generated for strings. It is convenient to write strings out 
separately to reduce the need for multiple location counters in a later assembly phase. 

The intermediate language is not machine-independent; its structure in a number of ways 
reflects the fact that C was originally a one-pass compiler chopped in two to reduce the max¬ 
imum memory requirement. In fact, only the latest version of the compiler has a complete 
intermediate language at all. Until recently, the first phase of the compiler generated assembly 
code for those constructions it could deal with, and passed expression parse trees, in absolute 
binary form, to the second phase for code generation. Now, at least, all inter-phase informa¬ 
tion is passed in a describable form, and there are no absolute pointers involved, so the cou¬ 
pling between the phases is not so strong. 

The areas in which the machine (and system) dependencies are most noticeable are 

1. Storage allocation for automatic variables and arguments has already been performed, 
and nodes for such variables refer to them by offset from a display pointer. Type conver¬ 
sion (for example, from integer to pointer) has already occurred using the assumption of 
byte addressing and 2-byte words. 

2. Data representations suitable to the PDP-11 are assumed; in particular, floating point 
constants are passed as four words in the machine representation. 

As it happens, each intermediate file is represented as a sequence of binary numbers 
without any explicit demarcations. It consists of a sequence of conceptual lines, each headed 
by an operator, and possibly containing various operands. The operators are small numbers; to 
assist in recognizing failure in synchronization, the high-order byte of each operator word is 
always the octal number 376. Operands are either 16-bit binary numbers or strings of charac¬ 
ters representing names. Each name is terminated by a null character. There is no alignment 
requirement for numerical operands and so there is no padding after a name string. 

The binary representation was chosen to avoid the necessity of converting to and from 
character form and to minimize the size of the files. It would be very easy to make each 
operator-operand ‘line’ in the file be a genuine, printable line, with the numbers in octal or 
decimal; this in fact was the representation originally used. 

The operators fall naturally into two classes: those which represent part of an expression, 
and all others. Expressions are transmitted in a reverse-Polish notation; as they are being 
read, a tree is built which is isomorphic to the tree constructed in the first phase. Expressions 
are passed as a whole, with no non-expression operators intervening. The reader maintains a 
stack; each leaf of the expression tree (name, constant) is pushed on the stack; each unary 
operator replaces the top of the stack by a node whose operand is the old top-of-stack; each 
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binary operator replaces the top pair on the stack with a single entry. When the expression is 
complete there is exactly one item on the stack. Following each expression is a special opera¬ 
tor which passes the unique previous expression to the ‘optimizer’ described below and then to 
the code generator. 

Here is the list of operators not themselves part of expressions. 


EOF 

marks the end of an input file. 

BDATA flag data ... 

specifies a sequence of bytes to be assembled as static data. It is followed by pairs of 
words; the first member of the pair is non-zero to indicate that the data continue; a zero 
flag is not followed by data and terminates the operator. The data bytes occupy the low- 
order part of a word. 

WDATA flag data ... 

specifies a sequence of words to be assembled as static data; it is identical to the BDATA 
operator except that entire words, not just bytes, are passed. 

PROG 

means that subsequent information is to be compiled as program text. 

DATA 

means that subsequent information is to be compiled as static data. 

BSS 

means that subsequent information is to be compiled as unitialized static data. 

SYMDEF name 

means that the symbol name is an external name defined in the current program. It is 
produced for each external data or function definition. 

CSPACE name size 

indicates that the name refers to a data area whose size is the specified number of bytes. 
It is produced for external data definitions without explicit initialization. 

SSPACE size 

indicates that size bytes should be set aside for data storage. It is used to pad out short 
initializations of external data and to reserve space for static (internal) data. It will be 
preceded by an appropriate label. 

EVEN 

is produced after each external data definition whose size is not an integral number of 
words. It is not produced after strings except when they initialize a character array. 

NLABEL name 

is produced just before a BDATA or WDATA initializing external data, and serves as a 
label for the data. 
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RLABEL name 

is produced just before each function definition, and labels its entry point. 

SNAME name number 

is produced at the start of each function for each static variable or label declared therein. 
Subsequent uses of the variable will be in terms of the given number. The code generator 
uses this only to produce a debugging symbol table. 

ANAME name number 

Likewise, each automatic variable’s name and stack offset is specified by this operator. 
Arguments count as automatics. 

RNAME name number 

Each register variable is similarly named, with its register number. 

SAVE number 

produces a register-save sequence at the start of each function, just after its label (RLA¬ 
BEL). 

SETREG number 

is used to indicate the number of registers used for register variables. It actually gives 
the register number of the lowest free register; it is redundant because the RNAME 
operators could be counted instead. 

PROFIL 

is produced before the save sequence for functions when the profile option is turned on. 
It produces code to count the number of times the function is called. 

SWIT deflab line label value ... 

is produced for switches. When control flows into it, the value being switched on is in 
the register forced by RFORCE (below). The switch statement occurred on the indicated 
line of the source, and the label number of the default location is deflab. Then the opera¬ 
tor is followed by a sequence of label-number and value pairs; the list is terminated by a 0 
label. 

LABEL number 

generates an internal label. It is referred to elsewhere using the given number. 

BRANCH number 

indicates an unconditional transfer to the internal label number given. 

RETRN 

produces the return sequence for a function. It occurs only once, at the end of each func¬ 
tion. 

EXPR line 

causes the expression just preceding to be compiled. The argument is the line number in 
the source where the expression occurred. 
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NAME class type name 


NAME class type number 

indicates a name occurring in an expression. The first form is used when the name is 
external; the second when the name is automatic, static, or a register. Then the number 
indicates the stack offset, the label number, or the register number as appropriate. Class 
and type encoding is described elsewhere. 

CON type value 

transmits an integer constant. This and the next two operators occur as part of expres¬ 
sions. 

FCON type 4-word-value 

transmits a floating constant as four words in PDP-11 notation. 

SFCON type value 

transmits a floating-point constant whose value is correctly represented by its high-order 
word in PDP-11 notation. 

NULL 

indicates a null argument list of a function call in an expression; call is a binary operator 
whose second operand is the argument list. 

CBRANCH label cond 

produces a conditional branch. It is an expression operator, and will be followed by an 
EXPR. The branch to the label number takes place if the expression’s truth value is the 
same as that of cond. That is, if cond=l and the expression evaluates to true, the 
branch is taken. 

binary-operator type 

There are binary operators corresponding to each such source-language operator; the type 
of the result of each is passed as well. Some perhaps-unexpected ones are: COMMA, 
which is a right-associative operator designed to simplify right-to-left evaluation of func¬ 
tion arguments; prefix and postfix ++ and —, whose second operand is the increment 
amount, as a CON; QUEST and COLON, to express the conditional expression as 
*a?(b:c)’; and a sequence of special operators for expressing relations between pointers, in 
case pointer comparison is different from integer comparison (e.g. unsigned). 

unary-operator type 

There are also numerous unary operators. These include ITOF, FTOI, FTOL, LTOF, 
ITOL, LTOI which convert among floating, long, and integer; JUMP which branches 
indirectly through a label expression; INIT, which compiles the value of a constant 
expression used as an initializer; RFORCE, which is used before a return sequence or a 
switch to place a value in an agreed-upon register. 

Expression Optimization 

Each expression tree, as it is read in, is subjected to a fairly comprehensive analysis. 
This is performed by the optim routine and a number of subroutines; the major things done 
are 

1. Modifications and simplifications of the tree so its value may be computed more 
efficiently and conveniently by the code generator. 
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2. Marking each interior node with an estimate of the number of registers required to evalu¬ 
ate it. This register count is needed to guide the code generation algorithm. 

One thing that is definitely not done is discovery or exploitation of common subexpres¬ 
sions, nor is this done anywhere in the compiler. 

The basic organization is simple: a depth-first scan of the tree. Optim does nothing for 
leaf nodes (except for automatics; see below), and calls unoptim to handle unary operators. 
For binary operators, it calls itself to process the operands, then treats each operator 
separately. One important case is commutative and associative operators, which are handled 
by acommute. 

Here is a brief catalog of the transformations carried out by by optim itself. It is not 
intended to be complete. Some of the transformations are machine-dependent, although they 
may well be useful on machines other than the PDP-11. 

1. As indicated in the discussion of unoptim below, the optimizer can create a node type 
corresponding to the location addressed by a register plus a constant offset. Since this is 
precisely the implementation of automatic variables and arguments, where the register is 
fixed by convention, such variables are changed to the new form to simplify later process¬ 
ing. 

2. Associative and commutative operators are processed by the special routine acommute. 

3. After processing by acommute, the bitwise & operator is turned into a new andn operator; 
a & b becomes ‘a andn b\ This is done because the PDP-11 provides no and operator, 
but only andn. A similar transformation takes place for ‘ = &\ 

4. Relational are turned around so the more complicated expression is on the left. (So that 
2 > f(x) becomes T(x) < 2’). This improves code generation since the algorithm prefers 
to have the right operand require fewer registers than the left. 

5. An expression minus a constant is turned into the expression plus the negative constant, 
and the acommute routine is called to take advantage of the properties of addition. 

6. Operators with constant operands are evaluated. 

7. Right shifts (unless by 1) are turned into left shifts with a negated right operand, since 
the PDP-11 lacks a general right-shift operator. 

8. A number of special cases are simplified, such as division or multiplication by 1, and 
shifts by 0. 

The unoptim routine performs the same sort of processing for unary operators. 

1. ‘*&x’ and ‘&*x’ are simplified to ‘x\ 

2. If t is a register and c is a constant or the address of a static or external variable, the 
expressions ‘*(r+c)’ and ‘V are turned into a special kind of name node which expresses 
the name itself and the offset. This simplifies subsequent processing because such con¬ 
structions can appear as the the address of a PDP-11 instruction. 

3. When the unary operator is applied to a name node of the special kind just discussed, 
it is reworked to make the addition explicit again; this is done because the PDP-11 has 
no ‘load address’ instruction. 

4. Constructions like ‘*r++’ and —r’ where r is a register are discovered and marked as 
being implementable using the PDP-11 auto-increment and -decrement modes. 

5. If T is applied to a relational, the T is discarded and the sense of the relational is 
reversed. 

6. Special cases involving reflexive use of negation and complementation are discovered. 

7. Operations applying to constants are evaluated. 

The acommute routine, called for associative and commutative operators, discovers clus¬ 
ters of the same operator at the top levels of the current tree, and arranges them in a list: for 
‘a+((b+c)+(d+f))’ the list would be‘a,b,c,d,e,f\ After each subtree is optimized, the list is 
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sorted in decreasing difficulty of computation; as mentioned above, the code generation algo¬ 
rithm works best when left operands are the difficult ones. The ‘degree of difficulty’ computed 
is actually finer than the mere number of registers required; a constant is considered simpler 
than the address of a static or external, which is simpler than reference to a variable. This 
makes it easy to fold all the constants together, and also to merge together the sum of a con¬ 
stant and the address of a static or external (since in such nodes there is space for an ‘offset’ 
value). There are also special cases, like multiplication by 1 and addition of 0. 

A special routine is invoked to handle sums of products. Distrib is based on the fact that it is 
better to compute ‘cl*c2*x + cl*y’ as ‘cl*(c2*x + y)’ and makes the divisibility tests required 
to assure the correctness of the transformation. This transformation is rarely possible with 
code directly written by the user, but it invariably occurs as a result of the implementation of 
multi-dimensional arrays. 

Finally, acommute reconstructs a tree from the list of expressions which result. 

Code Generation 

The grand plan for code-generation is independent of any particular machine; it depends 
largely on a set of tables. But this fact does not necessarily make it very easy to modify the 
compiler to produce code for other machines, both because there is a good deal of machine- 
dependent structure in the tables, and because in any event such tables are non-trivial to 
prepare. 

The arguments to the basic code generation routine rcexpr are a pointer to a tree 
representing an expression, the name of a code-generation table, and the number of a register 
in which the value of the expression should be placed. Rcexpr returns the number of the regis¬ 
ter in which the value actually ended up; its caller may need to produce a mov instruction if 
the value really needs to be in the given register. There are four code generation tables. 

Regtab is the basic one, which actually does the job described above: namely, compile 
code which places the value represented by the expression tree in a register. 

Cctab is used when the value of the expression is not actually needed, but instead the 
value of the condition codes resulting from evaluation of the expression. This table is used, for 
example, to evaluate the expression after if. It is clearly silly to calculate the value (0 or 1) of 
the expression ‘a= =b’ in the context ‘if (a= =b) ... ’ 

The sptab table is used when the value of an expression is to be pushed on the stack, for 
example when it is an actual argument. For example in the function call f(a) it is a bad idea 
to load a into a register which is then pushed on the stack, when there is a single instruction 
which does the job. 

The efftab table is used when an expression is to be evaluated for its side effects, not its 
value. This occurs mostly for expressions which are statements, which have no value. Thus 
the code for the statement ‘a = b’ need produce only the approoriate mov instruction, and 
need not leave the value of 6 in a register, while in the expression ‘a + (b = c)’ the value of ‘b 
= c’ will appear in a register. 

All of the tables besides regtab are rather small, and handle only a relatively few special 
cases. If one of these subsidiary tables does not contain an entry applicable to the given 
expression tree, rcexpr uses regtab to put the value of the expression into a register and then 
fixes things up; nothing need be done when the table was efftab, but a tst instruction is pro¬ 
duced when the table called for was cctab, and a mov instruction, pushing the register on the 
stack, when the table was sptab. 

The rcexpr routine itself picks off some special cases, then calls cexpr to do the real work; 
Cexpr tries to find an entry applicable to the given tree in the given table, and returns -1 if no 
such entry is found, letting rcexpr try again with a different table. A successful match yields a 
string containing both literal characters which are written out and pseudo-operations, or mac¬ 
ros, which are expanded. Before studying the contents of these strings we will consider how 
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table entries are matched against trees. 

Recall that most non-leaf nodes in an expression tree contain the name of the operator, 
the type of the value represented, and pointers to the subtrees (operands). They also contain 
an estimate of the number of registers required to evaluate the expression, placed there by the 
expression-optimizer routines. The register counts are used to guide the code generation pro¬ 
cess, which is based on the Sethi-Ullman algorithm. 

The main code generation tables consist of entries each containing an operator number 
and a pointer to a subtable for the corresponding operator. A subtable consists of a sequence 
of entries, each with a key describing certain properties of the operands of the operator 
involved; associated with the key is a code string. Once the subtable corresponding to the 
operator is found, the subtable is searched linearly until a key is found such that the properties 
demanded by the key are compatible with the operands of the tree node. A successful match 
returns the code string; an unsuccessful search, either for the operator in the main table or a 
compatble key in the subtable, returns a failure indication. 

The tables are all contained in a file which must be processed to obtain an assembly 
language program. Thus they are written in a special-purpose language. To provided 
definiteness to the following discussion, here is an example of a subtable entry. 

%n,aw 

F 

add A2,R 

The ‘%’ indicates the key; the information following (up to a blank line) specifies the code 
string. Very briefly, this entry is in the subtable for ‘+’ of regtab; the key specifies that the left 
operand is any integer, character, or pointer expression, and the right operand is any word 
quantity which is directly addressible (e.g. a variable or constant). The code string calls for the 
generation of the code to compile the left (first) operand into the current register (‘F’) and then 
to produce an ‘add’ instruction which adds the second operand (‘A2’) to the register (‘R’). All 
of the notation will be explained below. 

Only three features of the operands are used in deciding whether a match has occurred. 
They are: 

1. Is the type of the operand compatible with that demanded? 

2. Is the ‘degree of difficulty’ (in a sense described below) compatible? 

3. The table may demand that the operand have a (indirection operator) as its highest 

operator. 

As suggested above, the key for a subtable entry is indicated by a and a comma- 
separated pair of specifications for the operands. (The second specification is ignored for 
unary operators). A specification indicates a type requirement by including one of the follow¬ 
ing letters. If no type letter is present, any integer, character, or pointer operand will satisfy 
the requirement (not float, double, or long). 

b A byte (character) operand is required, 

w A word (integer or pointer) operand is required, 

f A float or double operand is required, 

d A double operand is required. 

1 A long (32-bit integer) operand is required. 

Before discussing the ‘degree of difficulty’ specification, the algorithm has to be explained 
more completely. Rcexpr (and cexpr) are called with a register number in which to place their 
result. Registers 0, 1, ... are used during evaluation of expressions; the maximum register 
which can be used in this way depends on the number of register variables, but in any event 
only registers 0 through 4 are available since r5 is used as a stack frame header and r6 (sp) and 
r7 (pc) have special hardware properties. The code generation routines assume that when 
called with register n as argument, they may use n+I, ... (up to the first register variable) as 
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temporaries. Consider the expression ‘X+Y\ where both X and Y are expressions. As a first 
approximation, there are three ways of compiling code to put this expression in register n. 

1. If Y is an addressible cell, (recursively) put X into register n and add Y to it. 

2. If Y is an expression that can be calculated in k registers, where k smaller than the 
number of registers available, compile X into register n, Y into register n+1, and add 
register n+1 to n. 

3'. Otherwise, compile Y into register n, save the result in a temporary (actually, on the 
stack) compile X into register n, then add in the temporary. 

The distinction between cases 2 and 3 therefore depends on whether the right operand 
can be compiled in fewer than k registers, where k is the number of free registers left after 
registers 0 through n are taken: 0 through n-1 are presumed to contain already computed tem¬ 
porary results; n will, in case 2, contain the value of the left operand while the right is being 
evaluated. 

These considerations should make clear the specification codes for the degree of difficulty, 
bearing in mind that a number of special cases are also present: 

z is satisfied when the operand is zero, so that special code can be produced for expressions 
like ‘x = O’. 

1 is satisfied when the operand is the constant 1, to optimize cases like left and right shift 
by 1, which can be done efficiently on the PDP-11, 
c is satisfied when the operand is a positive (16-bit) constant; this takes care of some spe¬ 
cial cases in long arithmetic. 

a is satisfied when the operand is addressible; this occurs not only for variables and con¬ 
stants, but also for some more complicated constructions, such as indirection through a 
simple variable, ‘*p++’ where p is a register variable (because of the PDP-ll’s auto¬ 
increment address mode), and “(p+c)’ where p is a register and c is a constant. Pre¬ 
cisely, the requirement is that the operand refers to a cell whose address can be written 
as a source or destination of a PDP-11 instruction, 
e is satisfied by an operand whose value can be generated in a register using no more than 
k registers, where k is the number of registers left (not counting the current register). 
The ‘e’ stands for ‘easy.’ 

n is satisfied by any operand. The ‘n’ stands for ‘anything.’ 

These degrees of difficulty are considered to lie in a linear ordering and any operand 
which satisfies an earlier-mentioned requirement will satisfy a later one. Since the subtables 
are searched linearly, if a ‘1’ specification is included, almost certainly a ‘z’ must be written 
first to prevent expressions containing the constant 0 to be compiled as if the 0 were 1. 

Finally, a key specification may contain a “’ which requires the operand to have an 
indirection as its leading operator. Examples below should clarify the utility of this 
specification. 

Now let us consider the contents of the code string associated with each subtable entry. 
Conventionally, lower-case letters in this string represent literal information which is copied 
directly to the output. Upper-case letters generally introduce specific macro-operations, some 
of which may be followed by modifying information. The code strings in the tables are written 
with tabs and new-lines used freely to suggest instructions which will be generated; the table¬ 
compiling program compresses tabs (using the 0200 bit of the next character) and throws away 
some of the new-lines. For example the macro ‘F’ is ordinarily written on a line by itself; but 
since its expansion will end with a new-line, the new-line after ‘F’ itself is dispensable. This is 
all to reduce the size of the stored tables. 

The first set of macro-operations is concerned with compiling subtrees. Recall that this 
is done by the cexpr routine. In the following discussion the ‘current register’ is generally the 
argument register to cexpr; that is, the place where the result is desired. The ‘next register’ is 
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numbered one higher than the current register. (This explanation isn’t fully true because of 

complications, described below, involving operations which require even-odd register pairs.) 

F causes a recursive call to the rcexpr routine to compile code which places the value of the 
first (left) operand of the operator in the current register. 

FI generates code which places the value of the first operand in the next register. It is 
incorrectly used if there might be no next register; that is, if the degree of difficulty of the 
first operand is not ‘easy;’ if not, another register might not be available. 

FS generates code which pushes the value of the first operand on the stack, by calling rcexpr 
specifying sptab as the table. 

Analogously, 

S, SI, SS 

compile the second (right) operand into the current register, the next register, or onto the 
stack. 

To deal with registers, there are 

R which expands into the name of the current register. 

R1 which expands into the name of the next register. 

R+ which expands into the the name of the current register plus 1. It was suggested above 

that this is the same as the next register, except for complications; here is one of them. 

Long integer variables have 32 bits and require 2 registers; in such cases the next register 
is the current register plus 2. The code would like to talk about both halves of the long 
quantity, so R refers to the register with the high-order part and R+ to the low-order 
part. 

R- This is another complication, involving division and mod. These operations involve a 
pair of registers of which the odd-numbered contains the left operand. Cexpr arranges 
that the current register is odd; the R- notation allows the code to refer to the next 
lower, even-numbered register. 

To refer to addressible quantities, there are the notations: 

A1 causes generation of the address specified by the first operand. For this to be legal, the 
operand must be addressible; its key must contain an ‘a’ or a more restrictive 
specification. 

A2 correspondingly generates the address of the second operand providing it has one. 

We now have enough mechanism to show a complete, if suboptimal, table for the + 

operator on word or byte operands. 
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%n,z 

F 

%n,l 

F 

inc R 

%n,aw 

F 

add A2,R 

%n,e 

F 

SI 

add R1,R 

%n,n 

SS 

F 

add (sp)+,R 

The first two sequences handle some special cases. Actually it turns out that handling a right 
operand of 0 is unnecessary since the expression-optimizer throws out adds of 0. Adding 1 by 
using the ‘increment’ instruction is done next, and then the case where the right operand is 
addressible. It must be a word quantity, since the PDP-11 lacks an ‘add byte’ instruction. 
Finally the cases where the right operand either can, or cannot, be done in the available regis¬ 
ters are treated. 

The next macro-instructions are conveniently introduced by noticing that the above table 
is suitable for subtraction as well as addition, since no use is made of the commutativity of 
addition. All that is needed is substitution of ‘sub’ for ‘add’ and ‘dec’ for ’inc.’ Considerable 
saving of space is achieved by factoring out several similar operations. 

I is replaced by a string from another table indexed by the operator in the node being 
expanded. This secondary table actually contains two strings per operator. 

I' is replaced by the second string in the side table entry for the current operator. 

Thus, given that the entries for *+’ and in the side table (which is called instab) are 
‘add’ and ‘inc,’ ‘sub’ and ‘dec’ respectively, the middle of of the above addition table can be 
written 

%n,l 

F 

r r 

%n,aw 

F 

I A2,R 

and it will be suitable for subtraction, and several other operators, as well. 

Next, there is the question of character and floating-point operations. 

B1 generates the letter ‘b’ if the first operand is a character, ‘f if it is float or double, and 
nothing otherwise. It is used in a context like ‘movBl’ which generates a ‘mov’, ‘movb, 
or ‘movf instruction according to the type of the operand. 

B2 is just like B1 but applies to the second operand. 
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BE generates ‘b’ if either operand is a character and null otherwise. 

BF generates T if the type of the operator node itself is float or double, otherwise null. 
For example, there is an entry in efftab for the * = ’ operator 



%a,aw 

%ab,a 


IBE A2,Al 


Note first that two key specifications can be applied to the same code string. Next, observe 
that when a word is assigned to a byte or to a word, or a word is assigned to a byte, a single 
instruction, a tyiov or movb as appropriate, does the job. However, when a byte is assigned to a 
word, it must pass through a register to implement the sign-extension rules: 


%a,n 


S 

IB1 R,A1 


Next, there is the question of handling indirection properly. Consider the expression ‘X 
+ *Y\ where X and Y are expressions, Assuming that Y is more complicated than just a vari¬ 
able, but on the other hand qualifies as ‘easy’ in the context, the expression would be compiled 
by placing the value of X in a register, that of *Y in the next register, and adding the registers. 
It is easy to see that a better job can be done by compiling X, then Y (into the next register), 
and producing the instruction symbolized by ‘add (Rl),R\ This scheme avoids generating the 
instruction ‘mov (Rl),Rr required actually to place the value of *Y in a register. A related 
situation occurs with the expression ‘X 4- *(p+6)’, which exemplifies a construction frequent in 
structure and array references. The addition table shown above would produce 

[put X in register R] 
mov p,Rl 
add $6,R1 
mov (R1),R1 
add R1,R 

when the best code is 

[put X in R] 
mov p,Rl 
add 6(R1),R 

As we said above, a key specification for a code table entry may require an operand to have an 
indirection as its highest operator. To make use of the requirement, the following macros are 
provided. 

F* the first operand must have the form *X. If in particular it has the form *(Y + c), for 
some constant c, then code is produced which places the value of Y in the current regis¬ 
ter. Otherwise, code is produced which loads X into the current register. 

FI* resembles F* except that the next register is loaded. 

S* resembles F* except that the second operand is loaded. 

Si* resembles S* except that the next register is loaded. 

FS* The first operand must have the form ‘*X\ Push the value of X on the stack. 

SS* resembles FS* except that it applies to the second operand. 

To capture the constant that may have been skipped over in the above macros, there are 

#1 The first operand must have the form *X; if in particular it has the form *(Y + c) for c a 
constant, then the constant is written out, otherwise a null string. 

#2 is the same as #1 except that the second operand is used. 
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Now we can improve the addition table above. Just before the *%n,e entry, put 

%n,ew* 

F 

SI* 

add #2(R1),R 

and just before the ‘%n,n’ put 

%n,nw* 

SS* 

F 

add *(sp)+,R 

When using the stacking macros there is no place to use the constant as an index word, so that 
particular special case doesn’t occur. 

The constant mentioned above can actually be more general than a number. Any quan¬ 
tity acceptable to the assembler as an expression will do, in particular the address of a static 
cell, perhaps with a numeric offset. If x is an external character array, the expression ‘x[i+5] 
= 0’ will generate the code 

mov i,r0 
clrb x+5(r0) 

via the table entry (in the ‘ = ’ part of efftab) 

%e*,z 

F 

FBI #1(R) 

Some machine operations place restrictions on the registers used. The divide instruction, used 
to implement the divide and mod operations, requires the dividend to be placed in the odd 
member of an even-odd pair; other peculiarities of multiplication make it simplest to put the 
multiplicand in an odd-numbered register. There is no theory which optimally accounts for 
this kind of requirement. Cexpr handles it by checking for a multiply, divide, or mod opera¬ 
tion; in these cases, its argument register number is incremented by one or two so that it is 
odd, and if the operation was divide or mod, so that it is a member of a free even-odd pair. 
The routine which determines the number of registers required estimates, conservatively, that 
at least two registers are required for a multiplication and three for the other peculiar opera¬ 
tors. After the expression is compiled, the register where the result actually ended up is 
returned. (Divide and mod are actually the same operation except for the location of the 
result). 

These operations are the ones which cause results to end up in unexpected places, and 
this possibility adds a further level of complexity. The simplest way of handling the problem is 
always to move the result to the place where the caller expected it, but this will produce 
unnecessary register moves in many simple cases; ‘a = b*c’ would generate 

mov b,rl 
mul c,rl 
mov rl,r0 
mov r0,a 

The next thought is used the passed-back information as to where the result landed to change 
the notion of the current register. While compiling the ‘ = ’ operation above, which comes from 
a table entry like 

%a,e 

S 

mov R,A1 
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it is sufficient to redefine the meaning of ‘R’ after processing the ‘S’ which does the multiply. 
This technique is in fact used; the tables are written in such a way that correct code is pro¬ 
duced. The trouble is that the technique cannot be used in general, because it invalidates the 
count of the number of registers required for an expression. Consider just ‘a*b + X’ where X 
is some expression. The algorithm assumes that the value of a*b, once computed, requires just 
one register. If there are three registers available, and X requires two registers to compute, 
then this expression will match a key specifying ‘%n,e\ If a*b is computed and left in register 
1, then there are, contrary to expectations, no longer two registers available to compute X, but 
only one, and bad code will be produced. To guard against this possibility, cexpr checks the 
result returned by recursive calls which implement F, S and their relatives. If the result is not 
in the expected register, then the number of registers required by the other operand is checked; 
if it can be done using those registers which remain even after making unavailable the 
unexpectedly-occupied register, then the notions of the ‘next register’ and possibly the ‘current 
register’ are redefined. Otherwise a register-copy instruction is produced. A register-copy is 
also always produced when the current operator is one of those which have odd-even require¬ 
ments. 

Finally, there are a few loose-end macro operations and facts about the tables. The 
operators: 

V is used for long operations. It is written with an address like a machine instruction; it 
expands into ‘adc’ (add carry) if the operation is an additive operator, ‘sbc’ (subtract 
carry) if the operation is a subtractive operator, and disappears, along with the rest of the 
line, otherwise. Its purpose is to allow common treatment of logical operations, which 
have no carries, and additive and subtractive operations, which generate carries. 

T generates a ‘tst’ instruction if the first operand of the tree does not set the condition 
codes correctly. It is used with divide and mod operations, which require a sign-extended 
32-bit operand. The code table for the operations contains an ‘sxt’ (sign-extend) instruc¬ 
tion to generate the high-order part of the dividend. 

H is analogous to the ‘F’ and ‘S’ macros, except that it calls for the generation of code for 
the current tree (not one of its operands) using regtab. It is used in cctab for all the 
operators which, when executed normally, set the condition codes properly according to 
the result. It prevents a ‘tst’ instruction from being generated for constructions like ‘if 
(a+b) ...’ since after calculation of the value of ‘a+b’ a conditional branch can be written 
immediately. 

All of the discussion above is in terms of operators with operands. Leaves of the expres¬ 
sion tree (variables and constants), however, are peculiar in that they have no operands. In 
order to regularize the matching process, cexpr examines its operand to determine if it is a leaf; 
if so, it creates a special ‘load’ operator whose operand is the leaf, and substitutes it for the 
argument tree; this allows the table entry for the created operator to use the ‘Al’ notation to 
load the leaf into a register. 

Purely to save space in the tables, pieces of subtables can be labelled and referred to later. 
It turns out, for example, that rather large portions of the the efftab table for the ‘ = ’ and ‘ = 4-’ 
operators are identical. Thus ‘ = ’ has an entry 

%[move3:] 

%a,aw 

%ab,a 

IBE A2,A1 

while part of the ‘ = 4’ table is 

%aw,aw 
% [move3] 

Labels are written as ‘%[ ... : ]’, before the key specifications; references are written with ‘% [ 
... ]’ after the key. Peculiarities in the implementation make it necessary that labels appear 
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before references to them. 

The example illustrates the utility of allowing separate keys to point to the same code 
string. The assignment code works properly if either the right operand is a word, or the left 
operand is a byte; but since there is no ‘add byte’ instruction the addition code has to be res¬ 
tricted to word operands. 

Delaying and reordering 

Intertwined with the code generation routines are two other, interrelated processes. The 
first, implemented by a routine called delay , is based on the observation that naive code gen¬ 
eration for the expression ‘a = b++’ would produce 

mov b,rO 
inc b 
mov rO,a 

The point is that the table for postfix ++ has to preserve the value of b before incrementing it; 
the general way to do this is to preserve its value in a register. A cleverer scheme would gen¬ 
erate 

mov b,a 
inc b 

Delay is called for each expression input to rcexpr, and it searches for postfix ++ and — 
operators. If one is found applied to a variable, the tree is patched to bypass the operator and 
compiled as it stands; then the increment or decrement itself is done. The effect is as if ‘a = 
b; b++’ had been written. In this example, of course, the user himself could have done the 
same job, but more complicated examples are easily constructed, for example ‘switch (x++)\ 
An essential restriction is that the condition codes not be required. It would be incorrect to 
compile ‘if (a++) ...’ as 

tst a 
inc a 
beq ... 

because the ‘inc’ destroys the required setting of the condition codes. 

Reordering is a similar sort of optimization. Many cases which it detects are useful 
mainly with register variables. If r is a register variable, the expression ‘r = x+y’ is best com¬ 
piled as 

mov x,r 
add y,r 

but the codes tables would produce 

mov x,rO 
add y,rO 
mov rO,r 

which is in fact preferred if r is not a register. (If r is not a register, the two sequences are the 
same size, but the second is slightly faster.) The scheme is to compile the expression as if it 
had been written ‘r = x; r =4- y\ The reorder routine is called with a pointer to each tree 
that rcexpr is about to compile; if it has the right characteristics, the ‘r = x’ tree is constructed 
and passed recursively to rcexpr; then the original tree is modified to read ‘r =+ y’ and the 
calling instance of rcexpr compiles that instead. Of course the whole business is itself recursive 
so that more extended forms of the same phenomenon are handled, like ‘r = x + y I z\ 

Care does have to be taken to avoid ‘optimizing’ an expression like ‘r = x + r’ into ‘r = 
x; r = + r’. It is required that the right operand of the expression on the right of the ‘ = ’ be a 
’, distinct from the register variable. 
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The second case that reorder handies is expressions of the form ‘r = X’ used as a subex¬ 
pression. Again, the code out of the tables for ‘x = r = y’ would be 

mov y,rO 
mov rO,r 
mov rO,x 

whereas if r were a register it would be better to produce 

mov y,r 
mov r,x 

When reorder discovers that a register variable is being assigned to in a subexpression, it calls 
rcexpr recursively to compile the subexpression, then fiddles the tree passed to it so that the 
register variable itself appears as the operand instead of the whole subexpression. Here care 
has to be taken to avoid an infinite regress, with rcexpr and reorder calling each other forever 
to handle assignments to registers. 

A third set of cases treated by reorder comes up when any name, not necessarily a regis¬ 
ter, occurs as a left operand of an assignment operator other than * = ’ or as an operand of 
prefix *++’ or ‘—Unless condition-code tests are involved, when a subexpression like ‘(a = + 
b)’ is seen, the assignment is performed and the argument tree modified so that a is its 
operand; effectively ‘x + (y =+ z)’ is compiled as ‘y =+ z; x + y\ Similarly, prefix increment 
and decrement are pulled out and performed first, then the remainder of the expression. 

Throughout code generation, the expression optimizer is called whenever delay or reorder 
change the expression tree. This allows some special cases to be found that otherwise would 
not be seen. 
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The M4 Macro Processor 


Brian W. Kernighan 

Dennis M. Ritchie 

Bell Laboratories 
Murray Hill, New Jersey 07974 

ABSTRACT 

M4 is a macro processor available on UNIXf and GCOS. Its primary use 
has been as a front end for Ratfor for those cases where parameterless macros 
are not adequately powerful. It has also been used for languages as disparate 
as C and Cobol. M4 is particularly suited for functional languages like Fortran, 
PL/I and C since macros are specified in a functional notation. 

M4 provides features seldom found even in much larger macro processors, 
including 

• arguments 

• condition testing 

• arithmetic capabilities 

• string and substring functions 

• file manipulation 

This paper is a user’s manual for M4. 

July 1, 1977 


t UNIX is a trademark of Bell Laboratories. 
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The M4 Macro Processor 


Brian W. Kernighan 

Dennis M. Ritchie 
Bell Laboratories 
Murray Hill, New Jersey 07974 


Introduction 

A macro processor is a useful way to 
enhance a programming language, to make 
it more palatable or more readable, or to 
tailor it to a particular application. The 
#define statement in C and the analogous 
define in Ratfor are examples of the basic 
facility provided by any macro processor — 
replacement of text by other text. 

The M4 macro processor is an exten¬ 
sion of a macro processor called M3 which 
was written by D. M. Ritchie for the AP-3 
minicomputer; M3 was in turn based on a 
macro processor implemented for [1]. 
Readers unfamiliar with the basic ideas of 
macro processing may wish to read some of 
the discussion there. 

M4 is a suitable front end for Ratfor 
and C, and has also been used successfully 
with Cobol. Besides the straightforward 
replacement of one string of text by 
another, it provides macros with arguments, 
conditional macro expansion, arithmetic, file 
manipulation, and some specialized string 
processing functions. 

The basic operation of M4 is to copy 
its input to its output. As the input is read, 
however, each alphanumeric “token” (that 
is, string of letters and digits) is checked. If 
it is the name of a macro, then the name of 
the macro is replaced by its defining text, 
and the resulting string is pushed back onto 
the input to be rescanned. Macros may be 
called with arguments, in which case the 
arguments are collected and substituted into 
the right places in the defining text before it 
is rescanned. 

M4 provides a collection of about 
twenty built-in macros which perform vari¬ 
ous useful operations; in addition, the user 
can define new macros. Built-ins and user- 
defined macros work exactly the same way, 


except that some of the built-in macros 
have side effects on the state of the process. 

Usage 

On UNIX, use 

m4 [files] 

Each argument file is processed in order; if 
there are no arguments, or if an argument is 
the standard input is read at that point. 
The processed text is written on the stan¬ 
dard output, which may be captured for 
subsequent processing with 

m4 [files] >outputfile 

On GCOS, usage is identical, but the pro¬ 
gram is called ,/m4. 

Defining Macros 

The primary built-in function of M4 is 
define, which is used to define new macros. 
The input 

define(name, stuff) 

causes the string name to be defined as 
stuff. All subsequent occurrences of name 
will be replaced by stuff, name must be 
alphanumeric and must begin with a letter 

(the underscore_counts as a letter), stuff 

is any text that contains balanced 
parentheses; it may stretch over multiple 
lines. 

Thus, as a typical example, 

define(N, 100) 

if (i > N) 

defines N to be 100, and uses this "symbolic 
constant” in a later if statement. 

The left parenthesis must immediately 
follow the word define, to signal that 
define has arguments. If a macro or built- 
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in name is not followed immediately by T, 
it is assumed to have no arguments. This is 
the situation for N above; it is actually a 
macro with no arguments, and thus when it 
is used there need be no (...) following it. 

You should also notice that a macro 
name is only recognized as such if it 
appears surrounded by non-alphanumerics. 
For example, in 

define(N, 100) 
if (NNN > 100) 

the variable NNN is absolutely unrelated to 
the defined macro N, even though it con¬ 
tains a lot of N’s. 

Things may be defined in terms of 
other things. For example, 

define(N, 100) 
define(M, N) 

defines both M and N to be 100. 

What happens if N is redefined? Or, 
to say it another way, is M defined as N or 
as 100? In M4, the latter is true — M is 
100, so even if N subsequently changes, M 
does not. 

This behavior arises because M4 
expands macro names into their defining 
text as soon as it possibly can. Here, that 
means that when the string N is seen as the 
arguments of define are being collected, it 
is immediately replaced by 100; it’s just as if 
you had said 

define(M, 100) 

in the first place. 

If this isn’t what you really want, 
there are two ways out of it. The first, 
which is specific to this situation, is to 
interchange the order of the definitions: 

define(M, N) 
define(N, 100) 

Now M is defined to be the string N, so 
when you ask for M later, you’ll always get 
the value of N at that time (because the M 
will be replaced by N which will be replaced 
by 100). 


Quoting 

The more general solution is to delay 
the expansion of the arguments of define 
by quoting them. Any text surrounded by 
the single quotes ' and ' is not expanded 
immediately, but has the quotes stripped 
off. If you say 

define(N, 100) 
define(M, 'N') 

the quotes around the N are stripped off as 
the argument is being collected, but they 
have served their purpose, and M is defined 
as the string N, not 100. The general rule is 
that M4 always strips off one level of single 
quotes whenever it evaluates something. 
This is true even outside of macros. If you 
want the word define to appear in the out¬ 
put, you have to quote it in the input, as in 

define' = 1; 

As another instance of the same thing, 
which is a bit more surprising, consider 
redefining N: 

define(N, 100) 
define(N, 200) 

Perhaps regrettably, the N in the second 
definition is evaluated as soon as it’s seen; 
that is, it is replaced by 100, so it’s as if you 
had written 

define( 100, 200) 

This statement is ignored by M4, since you 
can only define things that look like names, 
but it obviously doesn’t have the effect you 
wanted. To really redefine N, you must 
delay the evaluation by quoting: 

define(N, 100) 
define( N\ 200) 

In M4, it is often wise to quote the first 
argument of a macro. 

If ' and ' are not convenient for some 
reason, the quote characters can be changed 
with the built-in changequote: 

changequote([, ]) 

makes the new quote characters the left and 
right brackets. You can restore the original 
characters with just 
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changequote 

There are two additional built-ins 
related to define, undefine removes the 
definition of some macro or built-in: 

undefine('N') 

removes the definition of N. (Why are the 
quotes absolutely necessary?) Built-ins can 
be removed with undefine, as in 

undefine( define ) 

but once you remove one, you can never get 
it back. 

The built -in ifdef provides a way to 
determine if a macro is currently defined. 
In particular, M4 has pre-defined the names 
unix and geos on the corresponding sys¬ 
tems, so you can tell which one you’re using: 

ifdef( unix', define(wordsize,16)' ) 
ifdeffgcos', define(wordsize,36)' ) 

makes a definition appropriate for the par¬ 
ticular machine. Don’t forget the quotes! 

ifdef actually permits three argu¬ 
ments; if the name is undefined, the value of 
ifdef is then the third argument, as in 

ifdef( unix , on UNIX, not on UNIX) 
Arguments 

So far we have discussed the simplest 
form of macro processing — replacing one 
string by another (fixed) string. User- 
defined macros may also have arguments, so 
different invocations can have different 
results. Within the replacement text for a 
macro (the second argument of its define) 
any occurrence of $n will be replaced by the 
nth argument when the macro is actually 
used. Thus, the macro bump, defined as 

define(bump, $1 = $1 4- 1) 

generates code to increment its argument by 
1 : 

bump(x) 

is 

X = x -I- 1 

A macro can have as many arguments 
as you want, but only the first nine are 
accessible, through $1 to $9. (The macro 


name itself is $0, although that is less com¬ 
monly used.) Arguments that are not sup¬ 
plied are replaced by null strings, so we can 
define a macro cat which simply concaten¬ 
ates its arguments, like this: 

define(cat, $1$2$3$4$5$6$7$8$9) 

Thus 

cat(x, y, z) 
is equivalent to 
xyz 

$4 through $9 are null, since no 
corresponding arguments were provided. 

Leading unquoted blanks, tabs, or 
newlines that occur during argument collec¬ 
tion are discarded. All other white space is 
retained. Thus 

define(a, b c) 

defines a to be b c. 

Arguments are separated by commas, 
but parentheses are counted properly, so a 
comma protected” by parentheses does not 
terminate an argument. That is, in 

define(a, (b,c)) 

there are only two arguments; the second is 
literally (b,c). And of course a bare comma 
or parenthesis can be inserted by quoting it. 

Arithmetic Built-ins 

M4 provides two built-in functions for 
doing arithmetic on integers (only). The 
simplest is incr, which increments its 
numeric argument by 1. Thus to handle the 
common programming situation where you 
want a variable to be defined as "one more 
than N”, write 

define(N, 100) 
define(Nl, incr(N)') 

Then N1 is defined as one more than the 
current value of N. 

The more general mechanism for 
arithmetic is a built-in called eval, which is 
capable of arbitrary arithmetic on integers. 

It provides the operators (in decreasing 
order of precedence) 
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unary + and — 

** or * (exponentiation) 

* / % (modulus) 

4- - 

= = != < <= > > = 

! (not) 

& or && (logical and) 

I or I (logical or) 

Parentheses may be used to group opera¬ 
tions where needed. All the operands of an 
expression given to eval must ultimately be 
numeric. The numeric value of a true rela¬ 
tion (like 1>0) is 1, and false is 0. The pre¬ 
cision in eval is 32 bits on UNIX and 36 
bits on GCOS. 

As a simple example, suppose we want 
M to be 2**N+1. Then 

define(N, 3) 

define(M, eval(2**N+l)') 

As a matter of principle, it is advisable to 
quote the defining text for a macro unless it 
is very simple indeed (say just a number); it 
usually gives the result you want, and is a 
good habit to get into. 

File Manipulation 

You can include a new file in the input 
at any time by the built-in function 

include: 

include(filename) 

inserts the contents of filename in place of 
the include command. The contents of the 
file is often a set of definitions. The value 
of include (that is, its replacement text) is 
the contents of the file; this can be captured 
in definitions, etc. 

It is a fatal error if the file named in 
include cannot be accessed. To get some 
control over this situation, the alternate 
form sinclude can be used; sinclude 
("silent include”) says nothing and contin¬ 
ues if it can’t access the file. 

It is also possible to divert the output 
of M4 to temporary files during processing, 
and output the collected material upon com¬ 
mand. M4 maintains nine of these diver¬ 
sions, numbered 1 through 9. If you say 

divert(n) 

all subsequent output is put onto the end of 
a temporary file referred to as n. Diverting 


to this file is stopped by another divert 
command; in particular, divert or 
divert(O) resumes the normal output pro¬ 
cess. 

Diverted text is normally output all at 
once at the end of processing, with the 
diversions output in numeric order. It is 
possible, however, to bring back diversions 
at any time, that is, to append them to the 
current diversion. 

undivert 

brings back all diversions in numeric order, 
and undivert with arguments brings back 
the selected diversions in the order given. 
The act of undiverting discards the diverted 
stuff, as does diverting into a diversion 
whose number is not between 0 and 9 
inclusive. 

The value of undivert is not the 
diverted stuff. Furthermore, the diverted 
material is not rescanned for macros. 

The built-in divnum returns the 
number of the currently active diversion. 
This is zero during normal processing. 

System Command 

You can run any program in the local 
operating system with the syscmd built-in. 
For example, 

syscmd(date) 

on UNIX runs the date command. Nor¬ 
mally syscmd would be used to create a file 
for a subsequent include. 

To facilitate making unique file names, 
the built-in maketemp is provided, with 
specifications identical to the system func¬ 
tion mktemp: a string of XXXXX in the 
argument is replaced by the process id of 
the current process. 

Conditionals 

There is a built-in called ifelse which 
enables you to perform arbitrary conditional 
testing. In the simplest form, 

ifelse(a, b, c, d) 

compares the two strings a and b. If these 
are identical, ifelse returns the string c; 
otherwise it returns d. Thus we might 
define a macro called compare which com¬ 
pares two strings and returns "y es ” or "no” 
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if they are the same or different. 

define(compare, ifelse($l, $2, yes, no)') 

Note the quotes, which prevent too-early 
evaluation of ifelse. 

If the fourth argument is missing, it is 
treated as empty. 

ifelse can actually have any number 
of arguments, and thus provides a limited 
form of multi-way decision capability. In 
the input 

ifelse(a, b, c, d, e, f, g) 

if the string a matches the string b, the 
result is c. Otherwise, if d is the same as e, 
the result is f. Otherwise the result is g. If 
the final argument is omitted, the result is 
null, so 

ifelse(a, b, c) 

is c if a matches b, and null otherwise. 

String Manipulation 

The built-in len returns the length of 
the string that makes up its argument. 
Thus 

len(abcdef) 

is 6, and len((a,b)) is 5. 

The built-in substr can be used to 
produce substrings of strings. 
substr(s, i, n) returns the substring of s 
that starts at the ith position (origin zero), 
and is n characters long. If n is omitted, 
the rest of the string is returned, so 

substr (now is the time', 1) 

is 

ow is the time 

If i or n are out of range, various sensible 
things happen. 

index(sl, s2) returns the index (posi¬ 
tion) in si where the string s2 occurs, or -1 
if it doesn’t occur. As with substr, the ori¬ 
gin for strings is 0. 

The built-in translit performs charac¬ 
ter transliteration. 

translit(s, f, t) 

modifies s by replacing any character found 
in f by the corresponding character of t. 

That is, 


translit(s, aeiou, 12345) 

replaces the vowels by the corresponding 
digits. If t is shorter than f, characters 
which don’t have an entry in t are deleted; 
as a limiting case, if t is not present at all, 
characters from f are deleted from s. So 

translit(s, aeiou) 

deletes vowels from s. 

There is also a built-in called dnl 
which deletes all characters that follow it up 
to and including the next newline; it is use¬ 
ful mainly for throwing away empty lines 
that otherwise tend to clutter up M4 output. 
For example, if you say 

define(N, 100) 
define(M, 200) 
define(L, 300) 

the newline at the end of each line is not 
part of the definition, so it is copied into 
the output, where it may not be wanted. If 
you add dnl to each of these lines, the new¬ 
lines will disappear. 

Another way to achieve this, due to J. 
E. Weythman, is 

divert(— 1) 

define(...) 

divert 

Printing 

The built -in errprint writes its argu¬ 
ments out on the standard error file. Thus 
you can say 

errprint( fatal error ) 

dumpdef is a debugging aid which 
dumps the current definitions of defined 
terms. If there are no arguments, you get 
everything; otherwise you get the ones you 
name as arguments. Don’t forget to quote 
the names! 

Summary of Built-ins 

Each entry is preceded by the page 
number where it is described. 
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3 changequote(L, R) 

1 define(name, replacement) 

4 divert(number) 

4 divnum 

5 dnl 

5 dumpdeffname', name , ...) 

5 errprint(s, s, ...) 

4 eval(numeric expression) 

3 ifdeffname', this if true, this if false) 

5 ifelse(a, b, c, d) 

4 include(file) 

3 incr(number) 

5 index(sl, s2) 

5 len(string) 

4 maketemp (.. .XXXXX...) 

4 sinclude(file) 

.5 substr(string, position, number) 

4 syscmd(s) 

5 translit(str, from, to) 

3 undefine('name') 

4 undivert(number,number,...) 
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Awk — A Pattern Scanning and Processing Language 

(Second Edition) 

Alfred V. Aho 
Brian W. Kernighan 
Peter J. Weinberger 


ABSTRACT 

Awk is a programming language whose basic operation is to search a set 
of files for patterns, and to perform specified actions upon lines or fields of 
lines which contain instances of those patterns. Awk makes certain data selec¬ 
tion and transformation operations easy to express; for example, the awk pro¬ 
gram 

length > 72 

prints all input lines whose length exceeds 72 characters; the program 

NF % 2 = = 0 

prints all lines with an even number of fields; and the program 

{ $1 = log($1); print } 

replaces the first field of each line by its logarithm. 

Awk patterns may include arbitrary boolean combinations of regular 
expressions and of relational operators on strings, numbers, fields, variables, 
and array elements. Actions may include the same pattern-matching construc¬ 
tions as in patterns, as well as arithmetic and string expressions and assign¬ 
ments, if-else, while, for statements, and multiple output streams. 

This report contains a user’s guide, a discussion of the design and imple¬ 
mentation of awk f and some timing statistics. 


September 1, 1978 
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Awk — A Pattern Scanning and Processing Language 

(Second Edition) 

Alfred V. Aho 
Brian W. Kernighan 

Peter J. Weinberger 

Bell Laboratories 
Murray Hill, New Jersey 07974 


1. Introduction 

Awk is a programming language designed 
to make many common information retrieval and 
text manipulation tasks easy to state and to per¬ 
form. 

The basic operation of awk is to scan a set 
of input lines in order, searching for lines which 
match any of a set of patterns which the user 
has specified. For each pattern, an action can be 
specified; this action will be performed on each 
line that matches the pattern. 

Readers familiar with the UNIXf program 
grep 1 will recognize the approach, although in 
awk the patterns may be more general than in 
grep, and the actions allowed are more involved 
than merely printing the matching line. For 
example, the awk program 

{print $3, $2} 

prints the third and second columns of a table in 
that order. The program 

$2 - /A|B|C/ 

prints all input lines with an A, B, or C in the 
second field. The program 

$1 ! = prev { print; prev = $1 ) 

prints all lines in which the first field is different 
from the previous first field. 

1.1. Usage 

The command 
awk program [files] 

executes the awk commands in the string pro¬ 
gram on the set of named files, or on the stan¬ 
dard input if there are no files. The statements 
can also be placed in a file pfile, and executed by 
the command 

awk —f pfile [files] 


t UNIX is a trademark of Bell Laboratories. 


1.2. Program Structure 

An awk program is a sequence of state¬ 
ments of the form: 

pattern { action ) 
pattern { action ) 

Each line of input is matched against each of the 
patterns in turn. For each pattern that matches, 
the associated action is executed. When all the 
patterns have been tested, the next line is 
fetched and the matching starts over. 

Either the pattern or the action may be 
left out, but not both. If there is no action for a 
pattern, the matching line is simply copied to the 
output. (Thus a line which matches several pat¬ 
terns can be printed several times.) If there is no 
pattern for an action, then the action is per¬ 
formed for every input line. A line which 
matches no pattern is ignored. 

Since patterns and actions are both 
optional, actions must be enclosed in braces to 
distinguish them from patterns. 

1.3. Records and Fields 

Awk input is divided into “records” ter¬ 
minated by a record separator. The default 
record separator is a newline, so by default awk 
processes its input a line at a time. The number 
of the current record is available in a variable 
named NR. 

Each input record is considered to be 
divided into “fields.” Fields are normally 
separated by white space — blanks or tabs — 
but the input field separator may be changed, as 
described below. Fields are referred to as $1, 
$2, and so forth, where $1 is the first field, and 
$0 is the whole input record itself. Fields may 
be assigned to. The number of fields in the 
current record is available in a variable named 
NF. 
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The variables FS and RS refer to the 
input field and record separators; they may be 
changed at any time to any single character. 
The optional command-line argument -Fc may 
also be used to set FS to the character c. 

If the record separator is empty, an empty 
input line is taken as the record separator, and 
blanks, tabs and newlines are treated as field 
separators. 

The variable FILENAME contains the 
name of the current input file. 

1.4. Printing 

An action may have no pattern, in which 
case the action is executed for all lines. The sim¬ 
plest action is to print some or all of a record; 
this is accomplished by the awk command print. 
The awk program 

I print ) 

prints each record, thus copying the input to the 
output intact. More useful is to print a field or 
fields from each record. For instance, 

print $2, $1 

prints the first two fields in reverse order. Items 
separated by a comma in the print statement will 
be separated by the current output field separa¬ 
tor when output. Items not separated by com¬ 
mas will be concatenated, so 

print $1 $2 

runs the first and second fields together. 

The predefined variables NF and NR can 
be used; for example 

j print NR, NF, $0 ) 

prints each record preceded by the record 
number and the number of fields. 

Output may be diverted to multiple files; 
the program 

j print $1 >"fool"; print $2 >"foo2" ) 

writes the first field, $1, on the file fool, and 
the second field on file foo2. The » notation 
can also be used: 

print $1 »"foo" 

appends the output to the file foo. (In each case, 
the output files are created if necessary.) The file 
name can be a variable or a field as well as a 
constant; for example, 

print $1 >$2 

uses the contents of field 2 as a file name. 

Naturally there is a limit on the number of 
output files; currently it is 10. 


Similarly, output can be piped into another 
process (on UNIX only); for instance, 

print | "mail bwk" 

mails the output to bwk. 

The variables OFS and ORS may be used 
to change the current output field separator and 
output record separator. The output record 
separator is appended to the output of the print 
statement. 

Awk also provides the printf statement 
for output formatting: 

printf format expr, expr, ... 

formats the expressions in the list according to 
the specification in format and prints them. 
For example, 

printf "%8 .2f %101d\n", $1, $2 

prints $1 as a floating point number 8 digits 
wide, with two after the decimal point, and $2 as 
a 10-digit long decimal number, followed by a 
newline. No output separators are produced 
automatically; you must add them yourself, as in 
this example. The version of printf is identical 
to that used with C. 2 

2. Patterns 

A pattern in front of an action acts as a 
selector that determines whether the action is to 
be executed. A variety of expressions may be 
used as patterns: regular expressions, arithmetic 
relational expressions, string-valued expressions, 
and arbitrary boolean combinations of these. 

2.1. BEGIN and END 

The special pattern BEGIN matches the 
beginning of the input, before the first record is 
read. The pattern END matches the end of the 
input, after the last record has been processed. 
BEGIN and END thus provide a way to gain 
control before and after processing, for initializa¬ 
tion and wrapup. 

As an example, the field separator can be 
set to a colon by 

BEGIN ( FS = } 

... rest of program ... 

Or the input lines may be counted by 

END { print NR } 

If BEGIN is present, it must be the first pat¬ 
tern; END must be the last if used. 


105 




2.2. Regular Expressions 

The simplest regular expression is a literal 
string of characters enclosed in slashes, like 

/smith/ 

This is actually a complete awk program which 
will print all lines which contain any occurrence 
of the name “smith”. If a line contains “smith” 
as part of a larger word, it will also be printed, as 
in 

blacksmithing 

Awk regular expressions include the regular 
expression forms found in the UNIX text editor 
ed l and grep (without back-referencing). In 
addition, awk allows parentheses for grouping, I 
for alternatives, + for “one or more”, and ? for 
“zero or one”, all as in lex. Character classes 
may be abbreviated: [a-zA-ZO-9] is the set of 
all letters and digits. As an example, the awk 
program 

/[Aa]ho|[Ww]einberger|[Kk]ernighan/ 

will print all lines which contain any of the 
names “Aho,” “Weinberger” or “Kernighan,” 
whether capitalized or not. 

Regular expressions (with the extensions 
listed above) must be enclosed in slashes, just as 
in ed and sed. Within a regular expression, 
blanks and the regular expression metacharacters 
are significant. To turn of the magic meaning of 
one of the regular expression characters, precede 
it with a backslash. An example is the pattern 

/\/.*\// 

which matches any string of characters enclosed 
in slashes. 

One can also specify that any field or vari¬ 
able matches a regular expression (or does not 
match it) with the operators ~ and !~. The pro¬ 
gram 

$1 - /[jj]ohn/ 

prints all lines where the first field matches 
“john” or “John.” Notice that this will also 
match “Johnson”, “St. Johnsbury”, and so on. 
To restrict it to exactly [jJJohn, use 

$1 - r[jj]ohn$/ 

The caret A refers to the beginning of a line or 
field; the dollar sign $ refers to the end. 

2.3. Relational Expressions 

An awk pattern can be a relational expres¬ 
sion involving the usual relational operators <, 
< = , = =,! = ,> = , and >. An example is 


$2 > $1 + 100 

which selects lines where the second field is at 
least 100 greater than the first field. Similarly, 

NF % 2 = = 0 

prints lines with an even number of fields. 

In relational tests, if neither operand is 
numeric, a string comparison is made; otherwipe 
it is numeric. Thus, 

$1 >= "s" 

selects lines that begin with an s, t, u, etc. In 
the absence of any other information, fields are 
treated as strings, so the program 

$1 > $2 

will perform a string comparison. 

2.4. Combinations of Patterns 

A pattern can be any boolean combination 
of patterns, using the operators || (or), && (and), 
and ! (not). For example, 

$1 >= " s " && $1 < "t" && $1 != "smith" 

selects lines where the first field begins with “s”, 
but is not “smith”. && and || guarantee that 
their operands will be evaluated from left to 
right; evaluation stops as soon as the truth or 
falsehood is determined. 

2.5. Pattern Ranges 

The “pattern” that selects an action may 
also consist of two patterns separated by a 
comma, as in 

pat 1, pat2 (...) 

In this case, the action is performed for each line 
between an occurrence of patl and the next 
occurrence of pat2 (inclusive). For example, 

/start/, /stop/ 

prints all lines between start and stop, while 

NR == 100, NR == 200 (...) 

does the action for lines 100 through 200 of the 
input. 

3. Actions 

An awk action is a sequence of action 
statements terminated by newlines or semi¬ 
colons. These action statements can be used to 
do a variety of bookkeeping and string manipu¬ 
lating tasks. 
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3.1. Built-in Functions 

Awk provides a “length” function to com¬ 
pute the length of a string of characters. This 
program prints each record, preceded by its 
length: 

jprint length, $0) 

length by itself is a “pseudo-variable” which 
yields the length of the current record; 
length (argument) is a function which yields the 
length of its argument, as in the equivalent 

jprint length($0), $0) 

The argument may be any expression. 

Awk also provides the arithmetic functions 
sqrt, log, exp, and int, for square root, base e 
logarithm, exponential, and integer part of their 
respective arguments. 

The name of one of these built-in func¬ 
tions, without argument or parentheses, stands 
for the value of the function on the whole record. 
The program 

length < 10 || length > 20 

prints lines whose length is less than 10 or 
greater than 20. 

The function substr(s, m, n) produces the 
substring of s that begins at position m (origin 
1) and is at most n characters long. If n is omit¬ 
ted, the substring goes to the end of s. The 
function index(sl, s2) returns the position 
where the string s2 occurs in si, or zero if it 
does not. 

The function sprintf(f, el, e2, ...) pro¬ 
duces the value of the expressions el, e2, etc., in 
the printf format specified by f. Thus, for 
example, 

x = spr intf ("%8.2f %101d", $1, $2) 

sets x to the string produced by formatting the 
values of $1 and $2. 

3.2. Variables, Expressions, and Assign¬ 
ments 

Awk variables take on numeric (floating 
point) or string values according to context. For 
example, in 

x = 1 

x is clearly a number, while in 
x = "smith” 

it is clearly a string. Strings are converted to 
numbers and vice versa whenever context 
demands it. For instance, 

x = "3" + ”4" 


assigns 7 to x. Strings which cannot be inter¬ 
preted as numbers in a numerical context will 
generally have numeric value zero, but it is 
unwise to count on this behavior. 

By default, variables (other than built-ins) 
are initialized to the null string, which has 
numerical value zero; this eliminates the need for 
most BEGIN sections. For example, the sums 
of the first two fields can be computed by 

j si += $1; s2 += $2 | 

END { print si, s2 ) 

Arithmetic is done internally in floating 
point. The arithmetic operators are +, -, *, /, 
and % (mod). The C increment ++ and decre- 
merit — operators are also available, and so are 
the assignment operators + = , - = , * = , /=, and 
% = . These operators may all be used in expres¬ 
sions. 

3.3. Field Variables 

Fields in awk share essentially all of the 
properties of variables — they may be used in 
arithmetic or string operations, and may be 
assigned to. Thus one can replace the first field 
with a sequence number like this: 

( $1 = NR; print ) 

or accumulate two fields into a third, like this: 

( $1 = $2 + $3; print $0 ) 

or assign a string to a field: 

( if ($3 > 1000) 

$3 = "too big" 
print 

i 

which replaces the third field by “too big” when 
it is, and in any case prints the record. 

Field references may be numerical expres¬ 
sions, as in 

j print $i, $(i+l), $(i+n) ) 

Whether a field is deemed numeric or string 
depends on context; in ambiguous cases like 

if ($i = = $2) ... 

fields are treated as strings. 

Each input line is split into fields automat¬ 
ically as necessary. It is also possible to split any 
variable or string into fields: 

n = split(s, array, sep) 

splits the the string s into array[l], ..., 
array[n]. The number of elements found is 
returned. If the sep argument is provided, it is 
used as the field separator; otherwise FS is used 
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as the separator. 

3.4. String Concatenation 

Strings may be concatenated. For example 
length($l $2 $3) 

returns the length of the first three fields. Or in 
a print statement, 

print $1 " is " $2 

prints the two fields separated by “ is ”. Vari¬ 
ables and numeric expressions may also appear 
in concatenations. 

3.5. Arrays 

Array elements are not declared; they 
spring into existence by being mentioned. Sub¬ 
scripts may have any non-null value, including 
non-numeric strings. As an example of a con¬ 
ventional numeric subscript, the statement 

x[NR] = $0 

assigns the current input record to the NR-th 
element of the array x. In fact, it is possible in 
principle (though perhaps slow) to process the 
entire input in a random order with the awk pro¬ 
gram 

| x[NR] = $0 ) 

END { ... program ... ) 

The first action merely records each input line in 
the array x. 

Array elements may be named by non¬ 
numeric values, which gives awk a capability 
rather like the associative memory of Snobol 
tables. Suppose the input contains fields with 
values like apple, orange, etc. Then the pro¬ 
gram 

/apple/ j x["apple”]++ j 

/orange/) x["orange"]++ j 

END | print x["apple"], x["orange"] j 

increments counts for the named array elements, 
and prints them at the end of the input. 

3.6. Flow-of-Control Statements 

Awk provides the basic flow-of-control 
statements if-else, while, for, and statement 
grouping with braces, as in C. We showed the if 
statement in section 3.3 without describing it. 
The condition in parentheses is evaluated; if it is 
true, the statement following the if is done. The 
else part is optional. 

The while statement is exactly like that of 
C. For example, to print all input fields one per 
line, 


i = 1 

while (i <= NF) j 
print $i 
++i 


The for statement is also exactly that of 
C: 

for (i = 1; i <= NF; i++) 
print $i 

does the same job as the while statement above. 

There is an alternate form of the for 
statement which is suited for accessing the ele¬ 
ments of an associative array: 

for (i in array) 
statement 

does statement with i set in turn to each element 
of array. The elements are accessed in an 
apparently random order. Chaos will ensue if i 
is altered, or if any new elements are accessed 
during the loop. 

The expression in the condition part of an 
if, while or for can include relational operators 
like <, < = , >, > = , = = (“is equal to”), and ! = 
(“not equal to”); regular expression matches with 
the match operators ~ and !~; the logical opera¬ 
tors ||, &&, and !; and of course parentheses for 
grouping. 

The break statement causes an immediate 
exit from an enclosing while or for; the con¬ 
tinue statement causes the next iteration to 
begin. 

The statement next causes awk to skip 
immediately to the next record and begin scan¬ 
ning the patterns from the top. The statement 
exit causes the program to behave as if the end 
of the input had occurred. 

Comments may be placed in awk pro¬ 
grams: they begin with the character # and end 
with the end of the line, as in 

print x, y # this is a comment 

4. Design 

The UNIX system already provides several 
programs that operate by passing input through 
a selection mechanism. Grep, the first and sim¬ 
plest, merely prints all lines which match a single 
specified pattern. Egrep provides more general 
patterns, i.e., regular expressions in full general¬ 
ity; fgrep searches for a set of keywords with a 
particularly fast algorithm. Sed 1 provides most 
of the editing facilities of the editor ed, applied 
to a stream of input. None of these programs 
provides numeric capabilities, logical relations, or 
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variables. 

Lei c 3 provides general regular expression 
recognition capabilities, and, by serving as a C 
program generator, is essentially open-ended in 
its capabilities. The use of lex , however, requires 
a knowledge of C programming, and a lex pro¬ 
gram must be compiled and loaded before use, 
which discourages its use for one-shot applica¬ 
tions. 

Awk is an attempt to fill in another part of 
the matrix of possibilities. It provides general 
regular expression capabilities and an implicit 
input/output loop. But it also provides con¬ 
venient numeric processing, variables, more gen¬ 
eral selection, and control flow in the actions. It 
does not require compilation or a knowledge of 
C. Finally, awk provides a convenient way to 
access fields within lines; it is unique in this 
respect. 

Awk also tries to integrate strings and 
numbers completely, by treating all quantities as 
both string and numeric, deciding which 
representation is appropriate as late as possible. 
In most cases the user can simply ignore the 
differences. 

Most of the effort in developing awk went 
into deciding what awk should or should not do 
(for instance, it doesn’t do string substitution) 
and what the syntax should be (no explicit 
operator for concatenation) rather than on writ¬ 
ing or debugging the code. We have tried to 
make the syntax powerful but easy to use and 
well adapted to scanning files. For example, the 
absence of declarations and implicit initializa¬ 
tions, while probably a bad idea for a general- 
purpose programming language, is desirable in a 
language that is meant to be used for tiny pro¬ 
grams that may even be composed on the com¬ 
mand line. 

In practice, awk usage seems to fall into 
two broad categories. One is what might be 
called “report generation” — processing an input 
to extract counts, sums, sub-totals, etc. This 
also includes the writing of trivial data validation 
programs, such as verifying that a field contains 
only numeric information or that certain delim¬ 
iters are properly balanced. The combination of 
textual and numeric processing is invaluable 
here. 

A second area of use is as a data 
transformer, converting data from the form pro¬ 
duced by one program into that expected by 
another. The simplest examples merely select 
fields, perhaps with rearrangements. 


5. Implementation 

The actual implementation of awk uses the 
language development tools available on the 
UNIX operating system. The grammar is 
specified with yacc ; 4 the lexical analysis is done 
by lex ; the regular expression recognizers are 
deterministic finite automata constructed directly 
from the expressions. An awk program is 
translated into a parse tree which is then directly 
executed by a simple interpreter. 

Awk was designed for ease of use rather 
than processing speed; the delayed evaluation of 
variable types and the necessity to break input 
into fields makes high speed difficult to achieve 
in any case. Nonetheless, the program has not 
proven to be unworkably slow. 

Table I below shows the execution (user + 
system) time on a PDP-11/70 of the UNIX pro¬ 
grams wc , grep y egrepy fgrep y sed t lex y and awk on 
the following simple tasks: 

1. count the number of lines. 

2. print all lines containing “doug”. 

3. print all lines containing “doug”, “ken” or 
“dmr”. 

4. print the third field of each line. 

5. print the third and second fields of each 
line, in that order. 

6. append all lines containing “doug”, “ken”, 
and “dmr” to files “jdoug”, “jken”, and 
“jdmr”, respectively. 

7. print each line prefixed by “line- 
number : ”. 

8. sum the fourth column of a table. 

The program wc merely counts words, lines and 
characters in its input; we have already men¬ 
tioned the others. In all cases the input was a 
file containing 10,000 lines as created by the 
command Is -/; each line has the form 

-rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx 

The total length of this input is 452,960 charac¬ 
ters. Times for lex do not include compile or 
load. 

As might be expected, awk is not as fast as 
the specialized tools wc y sed y or the programs in 
the grep family, but is faster than the more gen¬ 
eral tool lex. In all cases, the tasks were about as 
easy to express as awk programs as programs in 
these other languages; tasks involving fields were 
considerably easier to express as awk programs. 
Some of the test programs are shown in awk f sed 
and lex. 
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Task 


Program 

1 

2 

3 

4 

5 

6 

7 

8 

wc 

8.6 








grep 

11.7 

13.1 







egrep 

6.2 

11.5 

11.6 






fgrep 

7.7 

13.8 

16.1 






sed 

10.2 

11.6 

15.8 

29.0 

30.5 

16.1 



lex 

65.1 

150.1 

144.2 

67.7 

70.3 

104.0 

81.7 

92.8 

awk 

15.0 

25.6 

29.9 

33.3 

38.9 

46.4 

71.4 

31.1 


Table I. Execution Times of Programs. (Times are in sec.) 


The programs for some of these jobs are 
shown below. The lex programs are generally too 
long to show. 

AWK: 

1. END {print NR) 

2. /doug/ 

3. /ken|doug|dmr/ 

4. {print $3) 

5. {print $3, $2) 

6. /ken/ {print >"jken”) 

/doug/ {print >"jdoug") 

/dmr/ {print >"jdmr") 

7. {print NR " $0) 

8. {sum = sum 4- $4) 

END {print sum) 

SED: 

1 . $ = 

2. /doug/p 

3. /doug/p 
/doug/d 
/ken/p 
/ken/d 
/dmr/p 
/dmr/d 

4. /[* ]* [ ]*[* ]* [ ]*\([* ]*\) . ’/s/Al/p 

5- /[* ]* [ ]*\([* ]*\) [ ]*\([* ]*\) .*/s//\2 \l/p 

6. /ken/w jken 
/doug/w jdoug 
/dmr/w jdmr 


LEX: 
i. %i 

int i ; 

w 

%% 

\n i++; 

%% 

yywrap() { 

printf("%d\n", i); 


2 - %% 

‘. 'doug.*$ printf("%s\n" 

\n ! 


yytext); 
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Make — A Program for Maintaining Computer Programs 

S. I. Feldman 


ABSTRACT 

In a programming project, it is easy to lose track of which files need to be 
reprocessed or recompiled after a change is made in some part of the source. 
Make provides a simple mechanism for maintaining up-to-date versions of pro¬ 
grams that result from many operations on a number of files. It is possible to 
tell Make the sequence of commands that create certain files, and the list of 
files that require other files to be current before the operations can be done. 
Whenever a change is made in any part of the program, the Make command 
will create the proper files simply, correctly, and with a minimum amount of 
effort. 

The basic operation of Make is to find the name of a needed target in the 
description, ensure that all of the files on which it depends exist and are up to 
date, and then create the target if it has not been modified since its generators 
were. The description file really defines the graph of dependencies; Make does 
a depth-first search of this graph to determine what work is really necessary. 

Make also provides a simple macro substitution facility and the ability to 
encapsulate commands in a single file for convenient administration. 


August 15, 1978 




Make — A Program for Maintaining Computer Programs 


S. I. Feldman 


Introduction 

It is common practice to divide large programs into smaller, more manageable pieces. 
The pieces may require quite different treatments: some may need to be run through a macro 
processor, some may need to be processed by a sophisticated program generator (e.g., Yacc[l] 
or Lex[2]). The outputs of these generators may then have to be compiled with special options 
and with certain definitions and declarations. The code resulting from these transformations 
may then need to be loaded together with certain libraries under the control of special options. 
Related maintenance activities involve running complicated test scripts and installing validated 
modules. Unfortunately, it is very easy for a programmer to forget which files depend on 
which others, which files have been modified recently, and the exact sequence of operations 
needed to make or exercise a new version of the program. After a long editing session, one 
may easily lose track of which files have been changed and which object modules are still valid, 
since a change to a declaration can obsolete a dozen other files. Forgetting to compile a rou¬ 
tine that has been changed or that uses changed declarations will result in a program that will 
not work, and a bug that can be very hard to track down. On the other hand, recompiling 
everything in sight just to be safe is very wasteful. 

The program described in this report mechanizes many of the activities of program 
development and maintenance. If the information on inter-file dependences and command 
sequences is stored in a file, the simple command 

make 

is frequently sufficient to update the interesting files, regardless of the number that have been 
edited since the last “make”. In most cases, the description file is easy to write and changes 
infrequently. It is usually easier to type the make command than to issue even one of the 
needed operations, so the typical cycle of program development operations becomes 

think — edit — make — test . . . 

Make is most useful for medium-sized programming projects; it does not solve the prob¬ 
lems of maintaining multiple source versions or of describing huge programs. Make was 
designed for use on Unix, but a version runs on GCOS. 

Basic Features 

The basic operation of make is to update a target file by ensuring that all of the files on 
which it depends exist and are up to date, then creating the target if it has not been modified 
since its dependents were. Make does a depth-first search of the graph of dependences. The 
operation of the command depends on the ability to find the date and time that a file was last 
modified. 

To illustrate, let us consider a simple example: A program named prog is made by compil¬ 
ing and loading three C-language files x .c, y.c, and z.c with the IS library. By convention, the 
output of the C compilations will be found in files named x .o, y.o, and z.o. Assume that the 
files x.c and y.c share some declarations in a file named defs , but that z.c does not. That is, x.c 
and y.c have the line 

#include "defs" 
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The following text describes the relationships and operations: 

prog : x.o y.o z.o 

cc x.o y.o z.o -IS -o prog 

x.o y.o : defs 

If this information were stored in a file named makefile , the command 
make 

would perform the operations needed to recreate prog after any changes had been made to any 
of the four source files jc.c, y.c, z.c, or defs. 

Make operates using three sources of information: a user-supplied description file (as 
above), file names and “last-modified” times from the file system, and built-in rules to bridge 
some of the gaps. In our example, the first line says that prog depends on three “.o” files. 
Once these object files are current, the second line describes how to load them to create prog. 
The third line says that x.o and y.o depend on the file defs. From the file system, make discov¬ 
ers that there are three “.c” files corresponding to the needed “o” files, and uses built-in infor¬ 
mation on how to generate an object from a source file (i.e. f issue a “cc -c” command). 

The following long-winded description file is equivalent to the one above, but takes no 
advantage of make's innate knowledge: 

prog : x.o y.o z.o 

cc x.o y.o z.o -IS -o prog 

x. o : x.c defs 

cc -c x.c 

y. o : y.c defs 

cc -c y.c 

z. o : z.c 

cc -c z.c 

If none of the source or object files had changed since the last time prog was made, all of 
the files would be current, and the command 

make 

would just announce this fact and stop. If, however, the defs file had been edited, x.c and y.c 
(but not z.c) would be recompiled, and then prog would be created from the new “.o” files. If 
only the file y.c had changed, only it would be recompiled, but it would still be necessary to 
reload prog. 

If no target name is given on the make command line, the first target mentioned in the 
description is created; otherwise the specified targets are made. The command 

make x.o 

would recompile x.o if x.c or defs had changed. 

If the file exists after the commands are executed, its time of last modification is used in 
further decisions; otherwise the current time is used. It is often quite useful to include rules 
with mnemonic names and commands that do not actually produce a file with that name. 
These entries can take advantage of make's ability to generate files and substitute macros. 
Thus, an entry “save” might be included to copy a certain set of files, or an entry “cleanup” 
might be used to throw away unneeded intermediate files. In other cases one may maintain a 
zero-length file purely to keep track of the time at which certain actions were performed. This 
technique is useful for maintaining remote archives and listings. 

Make has a simple macro mechanism for substituting in dependency lines and command 
strings. Macros are defined by command arguments or description file lines with embedded 
equal signs. A macro is invoked by preceding the name by a dollar sign; macro names longer 
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than one character must be parenthesized. The name of the macro is either the single charac¬ 
ter after the dollar sign or a name inside parentheses. The following are valid macro invoca¬ 
tions: 


$(CFLAGS) 

$2 

$(xy) 

$Z 

$(Z) 

The last two invocations are identical. $$ is a dollar sign. All of these macros are assigned 
values during input, as shown below. Four special macros change values during the execution 
of the command: $*, $@, $?, and $<. They will be discussed later. The following fragment 
shows the use: 

OBJECTS = x.o y.o z.o 
LIBES = -IS 
prog: $(OBJECTS) 

cc $(OBJECTS) $(LIBES) -o prog 


The command 
make 

loads the three object files with the IS library. The command 
make "LIBES = -11 -IS" 

loads them with both the Lex (“-11”) and the Standard (“-1S”) libraries, since macro definitions 
on the command line override definitions in the description. (It is necessary to quote argu¬ 
ments with embedded blanks in UNIXf commands.) 

The following sections detail the form of description files and the command line, and dis¬ 
cuss options and built-in rules in more detail. 

Description Files and Substitutions 

A description file contains three types of information: macro definitions, dependency 
information, and executable commands. There is also a comment convention: all characters 
after a sharp (#) are ignored, as is the sharp itself. Blank lines and lines beginning with a 
sharp are totally ignored. If a non-comment line is too long, it can be continued using a 
backslash. If the last character of a line is a backslash, the backslash, newline, and following 
blanks and tabs are replaced by a single blank. 

A macro definition is a line containing an equal sign not preceded by a colon or a tab. 
The name (string of letters and digits) to the left of the equal sign (trailing blanks and tabs are 
stripped) is assigned the string of characters following the equal sign (leading blanks and tabs 
are stripped.) The following are valid macro definitions: 

2 = xyz 

abc = -11 -ly -IS 
LIBES = 

The last definition assigns LIBES the null string. A macro that is never explicitly defined has 
the null string as value. Macro definitions may also appear on the make command line (see 
below). 

Other lines give information about target files. The general form of an entry is: 


t UNIX is a trademark of Bell Laboratories. 
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targetl [target2 . . .] :[:] [dependentl ...][; commands] [#...] 

[(tab) commands] [#...] 

Items inside brackets may be omitted. Targets and dependents are strings of letters, digits, 
periods, and slashes. (Shell metacharacters and “?” are expanded.) A command is any 
string of characters not including a sharp (except in quotes) or newline. Commands may 
appear either after a semicolon on a dependency line or on lines beginning with a tab immedi¬ 
ately following a dependency line. 

A dependency line may have either a single or a double colon. A target name may appear 
on more than one dependency line, but all of those lines must be of the same (single or double 
colon) type. 

1. For the usual single-colon case, at most one of these dependency lines may have a com¬ 
mand sequence associated with it. If the target is out of date with any of the dependents 
on any of the lines, and a command sequence is specified (even a null one following a 
semicolon or tab), it is executed; otherwise a default creation rule may be invoked. 

2. In the double-colon case, a command sequence may be associated with each dependency 
line; if the target is out of date with any of the files on a particular line, the associated 
commands are executed. A built-in rule may also be executed. This detailed form is of 
particular value in updating archive-type files. 

If a target must be created, the sequence of commands is executed. Normally, each com¬ 
mand line is printed and then passed to a separate invocation of the Shell after substituting for 
macros. (The printing is suppressed in silent mode or if the command line begins with an @ 
sign). Make normally stops if any command signals an error by returning a non-zero error 
code. (Errors are ignored if the “-i” flags has been specified on the make command line, if the 
fake target name “.IGNORE” appears in the description file, or if the command string in the 
description file begins with a hyphen. Some UNIX commands return meaningless status). 
Because each command line is passed to a separate invocation of the Shell, care must be taken 
with certain commands (e.g., cd and Shell control commands) that have meaning only within a 
single Shell process; the results are forgotten before the next line is executed. 

Before issuing any command, certain macros are set. $@ is set to the name of the file to 
be “made”. $? is set to the string of names that were found to be younger than the target. If 
the command was generated by an implicit rule (see below), $< is the name of the related file 
that caused the action, and $* is the prefix shared by the current and the dependent file names. 

If a file must be made but there are no explicit commands or relevant built-in rules, the 
commands associated with the name “.DEFAULT” are used. If there is no such name, make 
prints a message and stops. 

Command Usage 

The make command takes four kinds of arguments: macro definitions, flags, description 
file names, and target file names. 

make [ flags ] [ macro definitions ] [ targets ] 

The following summary of the operation of the command explains how these arguments are 
interpreted. 

First, all macro definition arguments (arguments with embedded equal signs) are analyzed 
and the assignments made. Command-line macros override corresponding definitions found in 
the description files. 

Next, the flag arguments are examined. The permissible flags are 
-i Ignore error codes returned by invoked commands. This mode is entered if the fake tar¬ 
get name “.IGNORE” appears in the description file. 


116 





-s Silent mode. Do not print command lines before executing. This mode is also entered if 
the fake target name “.SILENT” appears in the description file. 

-r Do not use the built-in rules. 

-n No execute mode. Print commands, but do not execute them. Even lines beginning with 
an “@” sign are printed. 

-t Touch the target files (causing them to be up to date) rather than issue the usual com¬ 
mands. 

-q Question. The make command returns a zero or non-zero status code depending on 
whether the target file is or is not up to date. 

-p Print out the complete set of macro definitions and target descriptions 
-d Debug mode. Print out detailed information on files and times examined. 

-f Description file name. The next argument is assumed to be the name of a description 
file. A file name of denotes the standard input. If there are no “-f” arguments, the 
file named makefile or Makefile in the current directory is read. The contents of the 
description files override the built-in rules if they are present). 

Finally, the remaining arguments are assumed to be the names of targets to be made; 
they are done in left to right order. If there are no such arguments, the first name in the 
description files that does not begin with a period is “made”. 

Implicit Rules 

The make program uses a table of interesting suffixes and 
supply default dependency information and implied commands, 
tables and means of overriding them.) The default suffix list is: 

.o Object file 

.c C source file 

.e Efl source file 

.r Ratfor source file 

./ Fortran source file 

.s Assembler source file 

.y Yacc-C source grammar 

.yr Yacc-Ratfor source grammar 

.ye Yacc-Efl source grammar 

.1 Lex source grammar 

The following diagram summarizes the default transformation paths. If there are two paths 
connecting a pair of suffixes, the longer one is used only if the intermediate file exists or is 
named in the description. 


a set of transformation rules to 
(The Appendix describes these 


o 


.c .r .e .f .s .y .yr .ye .1 .d 


.y .1 .yr .ye 

If the file x.o were needed and there were an x.c in the description or directory, it would 
be compiled. If there were also an x that grammar would be run through Lex before compil¬ 
ing the result. However, if there were no x.c but there were an x.l , make would discard the 
intermediate C-language file and use the direct link in the graph above. 


117 





It is possible to change the names of some of the compilers used in the default, or the flag 
arguments with which they are invoked by knowing the macro names used. The compiler 
names are the macros AS, CC, RC, EC, YACC, YACCR, YACCE, and LEX. The command 

make CC = newcc 

will cause the “newcc” command to be used instead of the usual C compiler. The macros 
CFLAGS, RFLAGS, EFLAGS, YFLAGS, and LFLAGS may be set to cause these commands 
to be issued with optional flags. Thus, 

make "CFLAGS = -O" 

causes the optimizing C compiler to be used. 

Example 

As an example of the use of make, we will present the description file used to maintain 
the make command itself. The code for make is spread over a number of C source files and a 
Yacc grammar. The description file contains: 

§ Description file for the Make command 
P = und -3 I opr -r2 # send to GCOS to be printed 

FILES = Makefile version.c defs main.c doname.c misc.c files.c dosys.cgram.y lex.c gcos.c 

OBJECTS = version.o main.o doname.o misc.o files.o dosys.o gram.o 

LIBES= -IS 

LINT = lint -p 

CFLAGS =-0 

make: $(OBJECTS) 

cc $(CFLAGS) $(OBJECTS) $(LIBES) -o make 
size make 

$(OBJECTS): defs 
gram.o: lex.c 

cleanup: 

-rm *.o gram.c 
-du 

install: 

@size make /usr/bin/make 
cp make /usr/bin/make ; rm make 

print: $(FILES) # print recently changed files 
pr $? I $P 
touch print 

test: 

make -dp I grep -v TIME >lzap 
/usr/bin/make -dp I grep -v TIME >2zap 
diff lzap 2zap 
rm lzap 2zap 

lint : dosys.c doname.c files.c main.c misc.c version.c gram.c 

$(LINT) dosys.c doname.c files.c main.c misc.c version.c gram.c 
rm gram.c 

arch: 

ar uv /sys/source/s2/make.a $(FILES) 

Make usually prints out each command before issuing it. The following output results from 
typing the simple command 
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make 


in a directory containing only the source and description file: 

cc -c version.c 
cc -c main.c 
cc -c doname.c 
cc -c misc.c 
cc -c files.c 
cc -c dosys.c 
yacc gram.y 
mv y.tab.c gram.c 
cc -c gram.c 

cc version.o main.o doname.o misc.o files.o dosys.o gram.o -IS -o make 
13188+3348+3044 = 19580b = 046174b 

Although none of the source files or grammars were mentioned by name in the description file, 
make found them using its suffix rules and issued the needed commands. The string of digits 
results from the “size make” command; the printing of the command line itself was suppressed 
by an @ sign. The @ sign on the size command in the description file suppressed the printing 
of the command, so only the sizes are written. 

The last few entries in the description file are useful maintenance sequences. The “print” 
entry prints only the files that have been changed since the last “make print” command. A 
zero-length file print is maintained to keep track of the time of the printing; the $? macro in 
the command line then picks up only the names of the files changed since print was touched. 
The printed output can be sent to a different printer or to a file by changing the definition of 
the P macro: 

make print "P = opr -sp" 
or 

make print "P = cat >zap” 

Suggestions and Warnings 

The most common difficulties arise from make's specific meaning of dependency. If file 
x.c has a “^include "defs"” line, then the object file x.o depends on defs; the source file x.c does 
not. (If defs is changed, it is not necessary to do anything to the file x.c , while it is necessary 
to recreate x.o.) 

To discover what make would do, the “-n” option is very useful. The command 
make -n 

orders make to print out the commands it would issue without actually taking the time to exe¬ 
cute them. If a change to a file is absolutely certain to be benign (e.g., adding a new definition 
to an include file), the “-t” (touch) option can save a lot of time: instead of issuing a large 
number of superfluous recompilations, make updates the modification times on the affected file. 
Thus, the command 

make -ts 

(“touch silently”) causes the relevant files to appear up to date. Obvious care is necessary, 
since this mode of operation subverts the intention of make and destroys all memory of the 
previous relationships. 

The debugging flag (“-d”) causes make to print out a very detailed description of what it 
is doing, including the file times. The output is verbose, and recommended only as a last 
resort. 
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Appendix. Suffixes and Transformation Rules 

The make program itself does not know what file name suffixes are interesting or how to 
transform a file with one suffix into a file with another suffix. This information is stored in an 
internal table that has the form of a description file. If the “-r” flag is used, this table is not 
used. 

The list of suffixes is actually the dependency list for the name “.SUFFIXES”; make 
looks for a file with any of the suffixes on the list. If such a file exists, and if there is a 
transformation rule for that combination, make acts as described earlier. The transformation 
rule names are the concatenation of the two suffixes. The name of the rule to transform a “r” 
file to a “.o” file is thus “.r.o”. If the rule is present and no explicit command sequence has 
been given in the user’s description files, the command sequence for the rule “.r.o” is used. If a 
command is generated by using one of these suffixing rules, the macro $* is given the value of 
the stem (everything but the suffix) of the name of the file to be made, and the macro $< is 
the name of the dependent that caused the action. 

The order of the suffix list is significant, since it is scanned from left to right, and the 
first name that is formed that has both a file and a rule associated with it is used. If new 
names are to be appended, the user can just add an entry for “.SUFFIXES” in his own descrip¬ 
tion file; the dependents will be added to the usual list. A “.SUFFIXES” line without any 
dependents deletes the current list. (It is necessary to clear the current list if the order of 
names is to be changed). 

The following is an excerpt from the default rules file: 

.SUFFIXES : .o .c .e .r .f .y .yr .ye .1 .s 

YACC=yacc 

YACCR = yacc -r 

YACCE = yacc -e 

YFLAGS = 

LEX=lex 
LFLAGS = 

CC=cc 
AS = as - 
CFLAGS = 

RC = ec 
RFLAGS = 

EC = ec 
EFLAGS = 

FFLAGS = 

.c.o : 

$(CC) $(CFLAGS) -c $< 

.e.o .r.o .f.o : 

$(EC) $(RFLAGS) $(EFLAGS) $(FFLAGS) -c $< 

.s.o : 

$(AS) -o $@ $< 

.y.o : 

$(YACC) $(YFLAGS) $< 

$(CC) $(CFLAGS) -c y.tab.c 
rm y.tab.c 
mv y.tab.o $@ 

.y.c : 

$(YACC) $(YFLAGS) $< 
mv y.tab.c $@ 
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Lint, a C Program Checker 


S. C. Johnson 


ABSTRACT 

Lint is a command which examines C source programs, detecting a 
number of bugs and obscurities. It enforces the type rules of C more strictly 
than the C compilers. It may also be used to enforce a number of portability 
restrictions involved in moving programs between different machines and/or 
operating systems. Another option detects a number of wasteful, or error 
prone, constructions which nevertheless are, strictly speaking, legal. 

Lint accepts multiple input files and library specifications, and checks 
them for consistency. 

The separation of function between lint and the C compilers has both 
historical and practical rationale. The compilers turn C programs into execut¬ 
able files rapidly and efficiently. This is possible in part because the compilers 
do not do sophisticated type checking, especially between separately compiled 
programs. Lint takes a more global, leisurely view of the program, looking 
much more carefully at the compatibilities. 

This document discusses the use of lint, gives an overview of the imple¬ 
mentation, and gives some hints on the writing of machine independent C 
code. 


July 26, 1978 
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Lint, a C Program Checker 

S. C. Johnson 


Introduction and Usage 

Suppose there are two C 1 source files, filel.c and file2.c y which are ordinarily compiled 
and loaded together. Then the command 

lint filel.c file2.c 

produces messages describing inconsistencies and inefficiencies in the programs. The program 
enforces the typing rules of C more strictly than the C compilers (for both historical and prac¬ 
tical reasons) enforce them. The command 

lint -p filel.c file2.c 

will produce, in addition to the above messages, additional messages which relate to the porta¬ 
bility of the programs to other operating systems and machines. Replacing the -p by -h will 
produce messages about various error-prone or wasteful constructions which, strictly speaking, 
are not bugs. Saying -hp gets the whole works. 

The next several sections describe the major messages; the document closes with sections 
discussing the implementation and giving suggestions for writing portable C. An appendix 
gives a summary of the lint options. 

A Word About Philosophy 

Many of the facts which lint needs may be impossible to discover. For example, whether 
a given function in a program ever gets called may depend on the input data. Deciding 
whether exit is ever called is equivalent to solving the famous “halting problem,” known to be 
recursively undecidable. 

Thus, most of the lint algorithms are a compromise. If a function is never mentioned, it 
can never be called. If a function is mentioned, lint assumes it can be called; this is not neces¬ 
sarily so, but in practice is quite reasonable. 

Lint tries to give information with a high degree of relevance. Messages of the form “xxx 
might be a bug” are easy to generate, but are acceptable only in proportion to the fraction of 
real bugs they uncover. If this fraction of real bugs is too small, the messages lose their credi¬ 
bility and serve merely to clutter up the output, obscuring the more important messages. 

Keeping these issues in mind, we now consider in more detail the classes of messages 
which lint produces. 

Unused Variables and Functions 

As sets of programs evolve and develop, previously used variables and arguments to func¬ 
tions may become unused; it is not uncommon for external variables, or even entire functions, 
to become unnecessary, and yet not be removed from the source. These “errors of commis¬ 
sion” rarely cause working programs to fail, but they are a source of inefficiency, and make 
programs harder to understand and change. Moreover, information about such unused vari¬ 
ables and functions can occasionally serve to discover bugs; if a function does a necessary job, 
and is never called, something is wrong! 

Lint complains about variables and functions which are defined but not otherwise men¬ 
tioned. An exception is variables which are declared through explicit extern statements but 
are never referenced; thus the statement 
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extern float sin(); 

will evoke no comment if sin is never used. Note that this agrees with the semantics of the C 
compiler. In some cases, these unused external declarations might be of some interest; they 
can be discovered by adding the —x flag to the lint invocation. 

Certain styles of programming require many functions to be written with similar inter¬ 
faces; frequently, some of the arguments may be unused in many of the calls. The -v option is 
available to suppress the printing of complaints about unused arguments. When -v is in 
effect, no messages are produced about unused arguments except for those arguments which 
are unused and also declared as register arguments; this can be considered an active (and 
preventable) waste of the register resources of the machine. 

There is one case where information about unused, or undefined, variables is more dis¬ 
tracting than helpful. This is when lint is applied to some, but not all, files out of a collection 
which are to be loaded together. In this case, many of the functions and variables defined may 
not be used, and, conversely, many functions and variables defined elsewhere may be used. 
The -u flag may be used to suppress the spurious messages which might otherwise appear. 

Set/Used Information 

Lint attempts to detect cases where a variable is used before it is set. This is very 
difficult to do well; many algorithms take a good deal of time and space, and still produce mes¬ 
sages about perfectly valid programs. Lint detects local variables (automatic and register 
storage classes) whose first use appears physically earlier in the input file than the first assign¬ 
ment to the variable. It assumes that taking the address of a variable constitutes a use, since 
the actual use may occur at any later time, in a data dependent fashion. 

The restriction to the physical appearance of variables in the file makes the algorithm 
very simple and quick to implement, since the true flow of control need not be discovered. It 
does mean that lint can complain about some programs which are legal, but these programs 
would probably be considered bad on stylistic grounds (e.g. might contain at least two goto’s). 
Because static and external variables are initialized to 0, no meaningful information can be 
discovered about their uses. The algorithm deals correctly, however, with initialized automatic 
variables, and variables which are used in the expression which first sets them. 

The set/used information also permits recognition of those local variables which are set 
and never used; these form a frequent source of inefficiencies, and may also be symptomatic of 
bugs. 

Flow of Control 

Lint attempts to detect unreachable portions of the programs which it processes. It will 
complain about unlabeled statements immediately following goto, break, continue, or return 
statements. An attempt is made to detect loops which can never be left at the bottom, detect¬ 
ing the special cases while( 1 ) and for(;;) as infinite loops. Lint also complains about loops 
which cannot be entered at the top; some valid programs may have such loops, but at best they 
are bad style, at worst bugs. 

Lint has an important area of blindness in the flow of control algorithm: it has no way of 
detecting functions which are called and never return. Thus, a call to exit may cause unreach¬ 
able code which lint does not detect; the most serious effects of this are in the determination of 
returned function values (see the next section). 

One form of unreachable statement is not usually complained about by lint; a break 
statement that cannot be reached causes no message. Programs generated by yacc, 2 and espe¬ 
cially lex, 3 may have literally hundreds of unreachable break statements. The —O flag in the 
C compiler will often eliminate the resulting object code inefficiency. Thus, these unreached 
statements are of little importance, there is typically nothing the user can do about them, and 
the resulting messages would clutter up the lint output. If these messages are desired, lint can 
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be invoked with the -b option. 

Function Values 

Sometimes functions return values which are never used; sometimes programs incorrectly 
use function “values” which have never been returned. Lint addresses this problem in a 
number of ways. 

Locally, within a function definition, the appearance of both 
return( expr ); 

and 


return ; 

statements is cause for alarm; lint will give the message 
function name contains return(e) and return 

The most serious difficulty with this is detecting when a function return is implied by flow of 
control reaching the end of the function. This can be seen with a simple example: 

f ( a ) { 

if ( a ) return ( 3 ); 



Notice that, if a tests false, f will call g and then return with no defined return value; this will 
trigger a complaint from lint. If g , like exit, never returns, the message will still be produced 
when in fact nothing is wrong. 

In practice, some potentially serious bugs have been discovered by this feature; it also 
accounts for a substantial fraction of the “noise” messages produced by lint. 

On a global scale, lint detects cases where a function returns a value, but this value is 
sometimes, or always, unused. When the value is always unused, it may constitute an 
inefficiency in the function definition. When the value is sometimes unused, it may represent 
bad style (e.g., not testing for error conditions). 

The dual problem, using a function value when the function does not return one, is also 
detected. This is a serious problem. Amazingly, this bug has been observed on a couple of 
occasions in “working” programs; the desired function value just happened to have been com: 
puted in the function return register! 

Type Checking 

Lint enforces the type checking rules of C more strictly than the compilers do. The addi¬ 
tional checking is in four major areas: across certain binary operators and implied assignments, 
at the structure selection operators, between the definition and uses of functions, and in the 
use of enumerations. 

There are a number of operators which have an implied balancing between types of the 
operands. The assignment, conditional ( ?: ), and relational operators have this property; the 
argument of a return statement, and expressions used in initialization also suffer similar 
conversions. In these operations, char, short, int, long, unsigned, float, and double types 
may be freely intermixed. The types of pointers must agree exactly, except that arrays of x’s 
can, of course, be intermixed with pointers to jc’s. 

The type checking rules also require that, in structure references, the left operand of the 
—> be a pointer to structure, the left operand of the . be a structure, and the right operand of 
these operators be a member of the structure implied by the left operand. Similar checking is 
done for references to unions. 
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Strict rules apply to function argument and return value matching. The types float and 
double may be freely matched, as may the types char, short, int, and unsigned. Also, 
pointers can be matched with the associated arrays. Aside from this, all actual arguments 
must agree in type with their declared counterparts. 

With enumerations, checks are made that enumeration variables or members are not 
mixed with other types, or other enumerations, and that the only operations applied are =, 
initialization, ==,! = , and function arguments and return values. 

Type Casts 

The type cast feature in C was introduced largely as an aid to producing more portable 
programs. Consider the assignment 

p = 1 ; 

where p is a character pointer. Lint will quite rightly complain. Now, consider the assignment 
p = (char 5k)l ; 

in which a cast has been used to convert the integer to a character pointer. The programmer 
obviously had a strong motivation for doing this, and has clearly signaled his intentions. It 
seems harsh for lint to continue to complain about this. On the other hand, if this code is 
moved to another machine, such code should be looked at carefully. The —c flag controls the 
printing of comments about casts. When -c is in effect, casts are treated as though they were 
assignments subject to complaint; otherwise, all legal casts are passed without comment, no 
matter how strange the type mixing seems to be. 

Nonportable Character Use 

On the PDP-11, characters are signed quantities, with a range from -128 to 127. On 
most of the other C implementations, characters take on only positive values. Thus, lint will 
flag certain comparisons and assignments as being illegal or nonportable. For example, the 
fragment 

char c; 

iff (c = getcharO) < 0 ) .... 

works on the PDP-11, but will fail on machines where characters always take on positive 
values. The real solution is to declare c an integer, since getchar is actually returning integer 
values. In any case, lint will say “nonportable character comparison”. 

A similar issue arises with bitfields; when assignments of constant values are made to 
bitfields, the field may be too small to hold the value. This is especially true because on some 
machines bitfields are considered as signed quantities. While it may seem unintuitive to con¬ 
sider that a two bit field declared of type int cannot hold the value 3, the problem disappears if 
the bitfield is declared to have type unsigned. 

Assignments of longs to ints 

Bugs may arise from the assignment of long to an int, which loses accuracy. This may 
happen in programs which have been incompletely converted to use typedefs. When a 
typedef variable is changed from int to long, the program can stop working because some 
intermediate results may be assigned to ints, losing accuracy. Since there are a number of leg¬ 
itimate reasons for assigning longs to ints, the detection of these assignments is enabled by 
the -a flag. 
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Strange Constructions 

Several perfectly legal, but somewhat strange, constructions are flagged by lint; the mes¬ 
sages hopefully encourage better code quality, clearer style, and may even point out bugs. The 
-h flag is used to enable these checks. For example, in the statement 

*p++ ; 

the * does nothing; this provokes the message “null effect” from lint. The program fragment 

unsigned x ; 
if( x < 0 ) ... 

is clearly somewhat strange; the test will never succeed. Similarly, the test 
if( x > 0 ) ... 
is equivalent to 

if( x != 0) 

which may not be the intended action. Lint will say “degenerate unsigned comparison” in 
these cases. If one says 

if( 1 != 0 ) .... 

lint will report “constant in conditional context”, since the comparison of 1 with 0 gives a con¬ 
stant result. 

Another construction detected by lint involves operator precedence. Bugs which arise 
from misunderstandings about the precedence of operators can be accentuated by spacing and 
formatting, making such bugs extremely hard to find. For example, the statements 

if( x&077 = = 0 ) ... 


or 


x«2 -I- 40 

probably do not do what was intended. The best solution is to parenthesize such expressions, 
and lint encourages this by an appropriate message. 

Finally, when the -h flag is in force lint complains about variables which are redeclared 
in inner blocks in a way that conflicts with their use in outer blocks. This is legal, but is con¬ 
sidered by many (including the author) to be bad style, usually unnecessary, and frequently a 
bug. 

Ancient History 

There are several forms of older syntax which are being officially discouraged. These fall 
into two classes, assignment operators and initialization. 

The older forms of assignment operators (e.g., =+, =-,...) could cause ambiguous 
expressions, such as 

a =-l ; 

which could be taken as either 
a =- 1 ; 
or 

a = -1; 

The situation is especially perplexing if this kind of ambiguity arises as the result of a macro 
substitution. The newer, and preferred operators (+ = , - = , etc. ) have no such ambiguities. 
To spur the abandonment of the older forms, lint complains about these old fashioned 
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operators. 

A similar issue arises with initialization. The older language allowed 
int x 1 ; 

to initialize x to 1. This also caused syntactic difficulties: for example, 
int x ( -1 ) ; 

looks somewhat like the beginning of a function declaration: 
int x ( y ) { ... 

and the compiler must read a fair ways past x in order to sure what the declaration really is.. 
Again, the problem is even more perplexing when the initializer involves a macro. The current 
syntax places an equals sign between the variable and the initializer: 

int x = -1 ; 

This is free of any possible syntactic ambiguity. 

Pointer Alignment 

Certain pointer assignments may be reasonable on some machines, and illegal on others, 
due entirely to alignment restrictions. For example, on the PDP-11, it is reasonable to assign 
integer pointers to double pointers, since double precision values may begin on any integer 
boundary. On the Honeywell 6000, double precision values must begin on even word boun¬ 
daries; thus, not all such assignments make sense. Lint tries to detect cases where pointers are 
assigned to other pointers, and such alignment problems might arise. The message “possible 
pointer alignment problem” results from this situation whenever either the -p or -h flags are 
in effect. 

Multiple Uses and Side Effects 

In complicated expressions, the best order in which to evaluate subexpressions may be 
highly machine dependent. For example, on machines (like the PDP-11) in which the stack 
runs backwards, function arguments will probably be best evaluated from right-to-left; on 
machines with a stack running forward, left-to-right seems most attractive. Function calls 
embedded as arguments of other functions may or may not be treated similarly to ordinary 
arguments. Similar issues arise with other operators which have side effects, such as the 
assignment operators and the increment and decrement operators. 

In order that the efficiency of C on a particular machine not be unduly compromised, the 
C language leaves the order of evaluation of complicated expressions up to the local compiler, 
and, in fact, the various C compilers have considerable differences in the order in which they 
will evaluate complicated expressions. In particular, if any variable is changed by a side effect, 
and also used elsewhere in the same expression, the result is explicitly undefined. 

Lint checks for the important special case where a simple scalar variable is affected. For 
example, the statement 

a[i] = 6[i++] ; 

will draw the complaint: 

warning: i evaluation order undefined 


Implementation 

Lint consists of two programs and a driver. The first program is a version of the Port¬ 
able C Compiler 4,5 which is the basis of the IBM 370, Honeywell 6000, and Interdata 8/32 C 
compilers. This compiler does lexical and syntax analysis on the input text, constructs and 
maintains symbol tables, and builds trees for expressions. Instead of writing an intermediate 
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file which is passed to a code generator, as the other compilers do, lint produces an intermedi¬ 
ate file which consists of lines of ascii text. Each line contains an external variable name, an 
encoding of the context in which it was seen (use, definition, declaration, etc.), a type specifier, 
and a source file name and line number. The information about variables local to a function or 
file is collected by accessing the symbol table, and examining the expression trees. 

Comments about local problems are produced as detected. The information about exter¬ 
nal names is collected onto an intermediate file. After all the source files and library descrip¬ 
tions have been collected, the intermediate file is sorted to bring all information collected about 
a given external name together. The second, rather small, program then reads the lines from 
the intermediate file and compares all of the definitions, declarations, and uses for consistency. 

The driver controls this process, and is also responsible for making the options available 
to both passes of lint. 

Portability 

C on the Honeywell and IBM systems is used, in part, to write system code for the host 
operating system. This means that the implementation of C tends to follow local conventions 
rather than adhere strictly to UNIXf system conventions. Despite these differences, many C 
programs have been successfully moved to GCOS and the various IBM installations with little 
effort. This section describes some of the differences between the implementations, and 
discusses the lint features which encourage portability. 

Uninitialized external variables are treated differently in different implementations of C. 
Suppose two files both contain a declaration without initialization, such as 

int a ; 

outside of any function. The UNIX loader will resolve these declarations, and cause only a sin¬ 
gle word of storage to be set aside for a. Under the GCOS and IBM implementations, this is 
not feasible (for various stupid reasons!) so each such declaration causes a word of storage to 
be set aside and called a. When loading or library editing takes place, this causes fatal conflicts 
which prevent the proper operation of the program. If lint is invoked with the -p flag, it will 
detect such multiple definitions. 

A related difficulty comes from the amount of information retained about external names 
during the loading process. On the UNIX system, externally known names have seven 
significant characters, with the upper/lower case distinction kept. On the IBM systems, there 
are eight significant characters, but the case distinction is lost. On GCOS, there are only six 
characters, of a single case. This leads to situations where programs run on the UNIX system, 
but encounter loader problems on the IBM or GCOS systems. Lint -p causes all external sym¬ 
bols to be mapped to one case and truncated to six characters, providing a worst-case analysis. 

A number of differences arise in the area of character handling: characters in the UNIX 
system are eight bit ascii, while they are eight bit ebcdic on the IBM, and nine bit ascii on 
GCOS. Moreover, character strings go from high to low bit positions (“left to right”) on GCOS 
and IBM, and low to high (“right to left”) on the PDP-11. This means that code attempting 
to construct strings out of character constants, or attempting to use characters as indices into 
arrays, must be looked at with great suspicion. Lint is of little help here, except to flag multi¬ 
character character constants. 

Of course, the word sizes are different! This causes less trouble than might be expected, 
at least when moving from the UNIX system (16 bit words) to the IBM (32 bits) or GCOS (36 
bits). The main problems are likely to arise in shifting or masking. C now supports a bit-field 
facility, which can be used to write much of this code in a reasonably portable way. Fre¬ 
quently, portability of such code can be enhanced by slight rearrangements in coding style. 
Many of the incompatibilities seem to have the flavor of writing 

t UNIX is a trademark of Bell Laboratories. 
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X & = 0177700 ; 

to clear the low order six bits of x. This suffices on the PDP-11, but fails badly on GCOS and 
IBM. If the bit field feature cannot be used, the same effect can be obtained by writing 

x&= -077; 

which will work on all these machines. 

The right shift operator is arithmetic shift on the PDP-11, and logical shift on most 
other machines. To obtain a logical shift on all machines, the left operand can be typed 
unsigned. Characters are considered signed integers on the PDP-11, and unsigned on the 
other machines. This persistence of the sign bit may be reasonably considered a bug in the 
PDP-11 hardware which has infiltrated itself into the C language. If there were a good way to 
discover the programs which would be affected, C could be changed; in any case, lint is no help 
here. 

The above discussion may have made the problem of portability seem bigger than it in 
fact is. The issues involved here are rarely subtle or mysterious, at least to the implementor of 
the program, although they can involve some work to straighten out. The most serious bar to 
the portability of UNIX system utilities has been the inability to mimic essential UNIX system 
functions on the other systems. The inability to seek to a random character position in a text 
file, or to establish a pipe between processes, has involved far more rewriting and debugging 
than any of the differences in C compilers. On the other hand, lint has been very helpful in 
moving the UNIX operating system and associated utility programs to other machines. 

Shutting Lint Up 

There are occasions when the programmer is smarter than lint. There may be valid rea¬ 
sons for “illegal” type casts, functions with a variable number of arguments, etc. Moreover, as 
specified above, the flow of control information produced by lint often has blind spots, causing 
occasional spurious messages about perfectly reasonable programs. Thus, some way of com¬ 
municating with lint , typically to shut it up, is desirable. 

The form which this mechanism should take is not at all clear. New keywords would 
require current and old compilers to recognize these keywords, if only to ignore them. 'This 
has both philosophical and practical problems. New preprocessor syntax suffers from similar 
problems. 

What was finally done was to cause a number of words to be recognized by lint when they 
were embedded in comments. This required minimal preprocessor changes; the preprocessor 
just had to agree to pass comments through to its output, instead of deleting them as had been 
previously done. Thus, lint directives are invisible to the compilers, and the effect on systems 
with the older preprocessors is merely that the lint directives don’t work. 

The first directive is concerned with flow of control information; if a particular place in 
the program cannot be reached, but this is not apparent to lint t this can be asserted by the 
directive 

/* NOTREACHED */ 

at the appropriate spot in the program. Similarly, if it is desired to turn off strict type check¬ 
ing for the next expression, the directive 

/* NOSTRICT */ 

can be used; the situation reverts to the previous default after the next expression. The —v 
flag can be turned on for one function by the directive 

/* ARGSUSED */ 

Complaints about variable number of arguments in calls to a function can be turned off by the 
directive 
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/* VARARGS V 


preceding the function definition. In some cases, it is desirable to check the first several argu¬ 
ments, and leave the later arguments unchecked. This can be done by following the 
VARARGS keyword immediately with a digit giving the number of arguments which should be 
checked; thus, 

/* VARARGS2 */ 

will cause the first two arguments to be checked, the others unchecked. Finally, the directive 
/* LINTLIBRARY */ 

at the head of a file identifies this file as a library declaration file; this topic is worth a section 
by itself. 

Library Declaration Files 

Lint accepts certain library directives, such as 

-ly 

and tests the source files for compatibility with these libraries. This is done by accessing 
library description files whose names are constructed from the library directives. These files all 
begin with the directive 

/* LINTLIBRARY */ 

which is followed by a series of dummy function definitions. The critical parts of these 
definitions are the declaration of the function return type, whether the dummy function 
returns a value, and the number and types of arguments to the function. The VARARGS and 
ARGSUSED directives can be used to specify features of the library functions. 

Lint library files are processed almost exactly like ordinary source files. The only 
difference is that functions which are defined on a library file, but are not used on a source file, 
draw no complaints. Lint does not simulate a full library search algorithm, and complains if 
the source files contain a redefinition of a library routine (this is a feature!). 

By default, lint checks the programs it is given against a standard library file, which con¬ 
tains descriptions of the programs which are normally loaded when a C program is run. When 
the -p flag is in effect, another file is checked containing descriptions of the standard I/O 
library routines which are expected to be portable across various machines. The -n flag can be 
used to suppress all library checking. 

Bugs, etc. 

Lint was a difficult program to write, partially because it is closely connected with 
matters of programming style, and partially because users usually don’t notice bugs which 
cause lint to miss errors which it should have caught. (By contrast, if lint incorrectly com¬ 
plains about something that is correct, the programmer reports that immediately!) 

A number of areas remain to be further developed. The checking of structures and arrays 
is rather inadequate; size incompatibilities go unchecked, and no attempt is made to match up 
structure and union declarations across files. Some stricter checking of the use of the typedef 
is clearly desirable, but what checking is appropriate, and how to carry it out, is still to be 
determined. 

Lint shares the preprocessor with the C compiler. At some point it may be appropriate 
for a special version of the preprocessor to be constructed which checks for things such as 
unused macro definitions, macro arguments which have side effects which are not expanded at 
all, or are expanded more than once, etc. 

The central problem with lint is the packaging of the information which it collects. 
There are many options which serve only to turn off, or slightly modify, certain features. 
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There are pressures to add even more of these options. 

In conclusion, it appears that the general notion of having two programs is a good one. 
The compiler concentrates on quickly and accurately turning the program text into bits which 
can be run; lint concentrates on issues of portability, style, and efficiency. Lint can afford to 
be wrong, since incorrectness and over-conservatism are merely annoying, not fatal. The com¬ 
piler can be fast since it knows that lint will cover its flanks. Finally, the programmer can con¬ 
centrate at one stage of the programming process solely on the algorithms, data structures, and 
correctness of the program, and then later retrofit, with the aid of lint, the desirable properties 
of universality and portability. 
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Appendix: Current Lint Options 

The command currently has the form 

lint [-options ] files... library-descriptors... 

The options are 
h Perform heuristic checks 

p Perform portability checks 

v Don’t report unused arguments 

u Don’t report unused or undefined externals 
b Report unreachable break statements, 

x Report unused external declarations 

a Report assignments of long to int or shorter, 
c Complain about questionable casts 

n No library checking is done 

8 Same as h (for historical reasons) 
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A Tutorial Introduction to ADB 

J. F. Maranzano 

S. R. Bourne 

UNIX 
Debugging 
C Programming 


ABSTRACT 

Debugging tools generally provide a wealth of information about the inner 
workings of programs. These tools have been available on UNIXt to allow 
users to examine “core” files that result from aborted programs. A new debug¬ 
ging program, ADB, provides enhanced capabilities to examine "core" and 
other program files in a variety of formats, run programs with embedded 
breakpoints and patch files. 

ADB is an indispensable but complex tool for debugging crashed systems 
and/or programs. This document provides an introduction to ADB with exam¬ 
ples of its use. It explains the various formatting options, techniques for 
debugging C programs, examples of printing file system information and patch¬ 
ing. 


May 5, 1977 


t UNIX is a trademark of Bell Laboratories. 
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A Tutorial Introduction to ADB 

J. F. Maranzano 
S. R. Bourne 


1. Introduction 


UNIX 
Debugging 
C Programming 

Bell Laboratories 
Murray Hill, New Jersey 07974 


ADB is a new debugging program that is available on UNIX. It provides capabilities to 
look at “core” files resulting from aborted programs, print output in a variety of formats, patch 
files, and run programs with embedded breakpoints. This document provides examples of the 
more useful features of ADB. The reader is expected to be familiar with the basic commands 
on UNIX with the C language, and with References 1, 2 and 3. 


2. A Quick Survey 


2.1. Invocation 

ADB is invoked as: 

adb objfile corefile 

where objfile is an executable UNIX file and corefile is a core image file. Many times this will 
look like: 


adb a.out core 


or more simply: 

adb 

where the defaults are a.out and core respectively. The filename minus (-) means ignore this 
argument as in: 

adb — core 


ADB has requests for examining locations in either file. The ? request examines the 
contents of objfile , the / request examines the corefile. The general form of these requests is: 

address ? format 

or 


address / format 


2.2. Current Address 

ADB maintains a current address, called dot, similar in function to the current pointer in 
the UNIX editor. When an address is entered, the current address is set to that location, so 
that: 

0126?i 

sets dot to octal 126 and prints the instruction at that address. The request: 

.,10/d 


tUNIX is a trademark of Bell Laboratories 
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prints 10 decimal numbers starting at dot. Dot ends up referring to the address of the last 
item printed. When used with the ? or / requests, the current address can be advanced by typ¬ 
ing newline; it can be decremented by typing . 

Addresses are represented by expressions. Expressions are made up from decimal, octal, 
and hexadecimal integers, and symbols from the program under test. These may be combined 
with the operators +, -, *, % (integer division), & (bitwise and), I (bitwise inclusive or), § 
(round up to the next multiple), and ' (not). (All arithmetic within ADB is 32 bits.) When typ¬ 
ing a symbolic address for a C program, the user can type name or_ name; ADB will recognize 

both forms. 

2.3. Formats 

To print data, a user specifies a collection of letters and characters that describe the for¬ 
mat of the printout. Formats are "remembered" in the sense that typing a request without one 
will cause the new printout to appear in the previous format. The following are the most com¬ 
monly used format letters. 

b one byte in octal 

c one byte as a character 

o one word in octal 

d one word in decimal 

f two words in floating point 

i PDP 11 instruction 

s a null terminated character string 

a the value of dot 

u one word as unsigned integer 

n print a newline 

r print a blank space 

* backup dot 

(Format letters are also available for "long" values, for example, ‘D’ for long decimal, and ‘F’ 
for double floating point.) For other formats see the ADB manual. 

2.4. General Request Meanings 

The general form of a request is: 

address,count command modifier 
which sets ‘dot’ to address and executes the command count times. 

The following table illustrates some general ADB command meanings: 

CommandMeaning 

? Print contents from a. out file 

/ Print contents from core file 

= Print value of "dot" 

: Breakpoint control 

$ Miscellaneous requests 

; Request separator 

! Escape to shell 

ADB catches signals, so a user cannot use a quit signal to exit from ADB. The request 
$q or $Q (or cntl-D) must be used to exit from ADB. 

3. Debugging C Programs 
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3.1. Debugging A Core Image 

Consider the C program in Figure 1. The program is used to illustrate a common error 
made by C programmers. The object of the program is to change the lower case "t" to upper 
case in the string pointed to by charp and then write the character string to the file indicated 
by argument 1. The bug shown is that the character "T" is stored in the pointer charp instead 
of the string pointed to by charp. Executing the program produces a core file because of an 
out of bounds memory reference. 

ADB is invoked by: 

adb a.out core 

The first debugging request: 

$c 

is used to give a C backtrace through the subroutines called. As shown in Figure 2 only one 
function (main) was called and the arguments argc and argv have octal values 02 and 0177762 
respectively. Both of these values look reasonable; 02 = two arguments, 0177762 = address 
on stack of parameter vector. 

The next request: 

$C 

is used to give a C backtrace plus an interpretation of all the local variables in each function 
and their values in octal. The value of the variable cc looks incorrect since cc was declared as a 
character. 

The next request: 

$r 

prints out the registers including the program counter and an interpretation of the instruction 
at that location. 

The request: 

$e 

prints out the values of all external variables. 

A map exists for each file handled by ADB. The map for the a.out file is referenced by ? 
whereas the map for core file is referenced by /. Furthermore, a good rule of thumb is to use ? 
for instructions and / for data when looking at programs. To print out information about the 
maps type: 

$m 

This produces a report of the contents of the maps. More about these maps later. 

In our example, it is useful to see the contents of the string pointed to by charp. This is 
done by: 

*charp/s 

which says use charp as a pointer in the core file and print the information as a character 
string. This printout clearly shows that the character buffer was incorrectly overwritten and 
helps identify the error. Printing the locations around charp shows that the buffer is 
unchanged but that the pointer is destroyed. Using ADB similarly, we could print information 
about the arguments to a function. The request: 

main.argc/d 

prints the decimal core image value of the argument argc in the function main. 

The request: 
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*main.argv,3/o 

prints the octal values of the three consecutive cells pointed to by argv in the function main. 
Note that these values are the addresses of the arguments to main. Therefore: 

0177770/s 

prints the ASCII value of the first argument. Another way to print this value would have been 
*"/s 

The " means ditto which remembers the last address typed, in this case main.argc ; the * 
instructs ADB to use the address field of the core file as a pointer. 

The request: 

.=o 

prints the current address (not its contents) in octal which has been set to the address of the 
first argument. The current address, dot, is used by ADB to "remember" its current location. 
It allows the user to reference locations relative to the current address, for example: 

.-10/d 

3.2. Multiple Functions 

Consider the C program illustrated in Figure 3. This program calls functions /, g, and h 
until the stack is exhausted and a core image is produced. 

Again you can enter the debugger via: 

adb 

which assumes the names a.out and core for the executable file and core image file respectively. 
The request: 

$c 

will fill a page of backtrace references to /, g, and h. Figure 4 shows an abbreviated list (typing 
DEL will terminate the output and bring you back to ADB request level). 

The request: 

,5$C 

prints the five most recent activations. 

Notice that each function (f,g,h ) has a counter of the number of times it was called. 

The request: 

fcnt/d 

prints the decimal value of the counter for the function f. Similarly gent and hent could be 
printed. To print the value of an automatic variable, for example the decimal value of x in the 
last call of the function h, type: 

h.x/d 

It is currently not possible in the exported version to print stack frames other than the most 
recent activation of a function. Therefore, a user can print everything with $C or the 
occurrence of a variable in the most recent call of a function. It is possible with the $C 
request, however, to print the stack frame starting at some address as address$C. 






3.3. Setting Breakpoints 

Consider the C program in Figure 5. This program, which changes tabs into blanks, is 
adapted from Software Tools by Kernighan and Plauger, pp. 18-27. 

We will run this program under the control of ADB (see Figure 6a) by: 

adb a.out - 

Breakpoints are set in the program as: 

address:b [request] 

The requests: 

settab+4:b 

fopen+4:b 

getc+4:b 

tabpos+4:b 

set breakpoints at the start of these functions. C does not generate statement labels. There¬ 
fore it is currently not possible to plant breakpoints at locations other than function entry 
points without a knowledge of the code generated by the C compiler. The above addresses are 
entered as symbol+4 so that they will appear in any C backtrace since the first instruction of 
each function is a call to the C save routine ( csv ). Note that some of the functions are from 
the C library. 

To print the location of breakpoints one types: 

$b 

The display indicates a count field. A breakpoint is bypassed count -1 times before causing a 
stop. The command field indicates the ADB requests to be executed each time the breakpoint 
is encountered. In our example no command fields are present. 

By displaying the original instructions at the function settab we see that the breakpoint is 
set after the jsr to the C save routine. We can display the instructions using the ADB request: 

settab,5?ia 

This request displays five instructions starting at settab with the addresses of each location 
displayed. Another variation is: 

settab, 5?i 

which displays the instructions with only the starting address. 

Notice that we accessed the addresses from the a.out file with the ? command. In general 
when asking for a printout of multiple items, ADB will advance the current address the 
number of bytes necessary to satisfy the request; in the above example five instructions were 
displayed and the current address was advanced 18 (decimal) bytes. 

To run the program one simply types: 


:r 

To delete a breakpoint, for instance the entry to the function settab , one types: 

settab+4:d 

To continue execution of the program from the breakpoint type: 

:c 

Once the program has stopped (in this case at the breakpoint for fopen), ADB requests 
can be used to display the contents of memory. For example: 

$C 
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to display a stack trace, or: 

tabs,3/8o 

to print three lines of 8 locations each from the array called tabs. By this time (at location 
fopen) in the C program, settab has been called and should have set a one in every eighth loca¬ 
tion of tabs. 

3.4. Advanced Breakpoint Usage 

We continue execution of the program with: 

:c 

See Figure 6b. Getc is called three times and the contents of the variable c in the function 
main are displayed each time. The single character on the left hand edge is the output from 
the C program. On the third occurrence of getc the program stops. We can look at the full 
buffer of characters by typing: 

ibuf+6/20c 

When we continue the program with: 


:c 

we hit our first breakpoint at tabpos since there is a tab following the "This" word of the data. 

Several breakpoints of tabpos will occur until the program has changed the tab into 
equivalent blanks. Since we feel that tabpos is working, we can remove the breakpoint at that 
location by: 

tabpos+4:d 

If the program is continued with: 


:c 

it resumes normal execution after ADB prints the message 

a.out:running 

The UNIX quit and interrupt signals act on ADB itself rather than on the program being 
debugged. If such a signal occurs then the program being debugged is stopped and control is 
returned to ADB. The signal is saved by ADB and is passed on to the test program if: 

:c 

is typed. This can be useful when testing interrupt handling routines. The signal is not passed 
on to the test program if: 

:c 0 

is typed. 

Now let us reset the breakpoint at settab and display the instructions located there when 
we reach the breakpoint. This is accomplished by: 

settab+4:b settab,5?ia * 

It is also possible to execute the ADB requests for each occurrence of the breakpoint but only 

* Owing to a bug in early versions of ADB (including the version distributed in Generic 3 UNIX) these state¬ 
ments must be written as: 

settab+4:b settab,5?ia;0 

getc+4,3:b main.c?C;0 

settab+4:b settab,5?ia; ptab/o;0 

Note that ;0 will set dot to zero and stop at the breakpoint. 
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stop after the third occurrence by typing: 

getc+4,3:b main.c?C * 

This request will print the local variable c in the function main at each occurrence of the 
breakpoint. The semicolon is used to separate multiple ADB requests on a single line. 

Warning: setting a breakpoint causes the value of dot to be changed; executing the pro¬ 
gram under ADB does not change dot. Therefore: 

settab+4:b .,5?ia 
fopen+4:b 

will print the last thing dot was set to (in the example fopen+4) not the current location (set- 
tab+4) at which the program is executing. 

A breakpoint can be overwritten without first deleting the old breakpoint. For example: 
settab-f 4:b settab,5?ia; ptab/o * 
could be entered after typing the above requests. 

Now the display of breakpoints: 

$b 

shows the above request for the settab breakpoint. When the breakpoint at settab is encoun¬ 
tered the ADB requests are executed. Note that the location at settab-\-4 has been changed to 
plant the breakpoint; all the other locations match their original value. 

Using the functions, f y g and h shown in Figure 3, we can follow the execution of each 
function by planting non-stopping breakpoints. We call ADB with the executable program of 
Figure 3 as follows: 

adb ex3 - 

Suppose we enter the following breakpoints: 

h+4:b hcnt/d; h.hi/; h.hr/ 

g+4:b gcnt/d; g.gi/; g.gr/ 

f+4:b fcnt/d; f.fi/; f.fr/ 

:r 

Each request line indicates that the variables are printed in decimal (by the specification d). 
Since the format is not changed, the d can be left off all but the first request. 

The output in Figure 7 illustrates two points. First, the ADB requests in the breakpoint 
line are not examined until the program under test is run. That means any errors in those 
ADB requests is not detected until run time. At the location of the error ADB stops running 
the program. 

The second point is the way ADB handles register variables. ADB uses the symbol table 
to address variables. Register variables, like f.fr above, have pointers to uninitialized places on 
the stack. Therefore the message "symbol not found”. 

Another way of getting at the data in this example is to print the variables used in the 
call as: 

f+4:b fcnt/d; f.a/; f.b/; f.fi/ 

g+4:b gcnt/d; g.p/; g.q/; g.gi/ 

:c 

The operator / was used instead of ? to read values from the core file. The output for each 
function, as shown in Figure 7, has the same format. For the function /, for example, it shows 
the name and value of the external variable font. It also shows the address on the stack and 
value of the variables a, b and fi. 






142 






Notice that the addresses on the stack will continue to decrease until no address space is 
left for program execution at which time (after many pages of output) the program under test 
aborts. A display with names would be produced by requests like the following: 

f+4:b fcnt/d; f.a/"a="d; f.b/"b="d; f.fi/"fi = "d 

In this format the quoted string is printed literally and the d produces a decimal display of the 
variables. The results are shown in Figure 7. 

3.5. Other Breakpoint Facilities 

• Arguments and change of standard input and output are passed to a program as: 

:r argl arg2 ... <infile >outfile 

This request kills any existing program under test and starts the a.out afresh. 

• The program being debugged can be single stepped by: 

:s 

If necessary, this request will start up the program being debugged and stop after execut¬ 
ing the first instruction. 

• ADB allows a program to be entered at a specific address by typing: 

address:r 

• The count field can be used to skip the first n breakpoints as: 

,n:r 

The request: 

,n:c 

may also be used for skipping the first n breakpoints when continuing a program. 

• A program can be continued at an address different from the breakpoint by: 

address:c 

• The program being debugged runs as a separate process and can be killed by: 

:k 

4. Maps 

UNIX supports several executable file formats. These are used to tell the loader how to 
load the program file. File type 407 is the most common and is generated by a C compiler 
invocation such as cc pgm.c. A 410 file is produced by a C compiler command of the form cc 
-n pgm.c, whereas a 411 file is produced by cc -i pgm.c. ADB interprets these different file 
formats and provides access to the different segments through a set of maps (see Figure 8). To 
print the maps type: 

$m 

In 407 files, both text (instructions) and data are intermixed. This makes it impossible 
for ADB to differentiate data from instructions and some of the printed symbolic addresses 
look incorrect; for example, printing data addresses as offsets from routines. 

In 410 files (shared text), the instructions are separated from data and ?* accesses the 
data part of the a.out file. The ?* request tells ADB to use the second part of the map in the 
a.out file. Accessing data in the core file shows the data after it was modified by the execution 
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of the program. Notice also that the data segment may have grown during program execution. 

In 411 files (separated I & D space), the instructions and data are also separated. How¬ 
ever, in this case, since data is mapped through a separate set of segmentation registers, the 
base of the data segment is also relative to address zero. In this case since the addresses over¬ 
lap it is necessary to use the ?* operator to access the data space of the a.out file. In both 410 
and 411 files the corresponding core file does not contain the program text. 

Figure 9 shows the display of three maps for the same program linked as a 407, 410, 411 
respectively. The b, e, and f fields are used by ADB to map addresses into file addresses. The 
"fl" field is the length of the header at the beginning of the file (020 bytes for an a.out file and 
02000 bytes for a core file). The "f2" field is the displacement from the beginning of the file to 
the data. For a 407 file with mixed text and data this is the same as the length of the header; 
for 410 and 411 files this is the length of the header plus the size of the text portion. 

The "b" and "e" fields are the starting and ending locations for a segment. Given an 
address, A, the location in the file (either a.out or core) is calculated as: 

bl^A^el => file address = (A-bl)+fl 
b2^A^e2 =$► file address = (A-b2)+f2 

A user can access locations by using the ADB defined variables. The $v request prints the 
variables initialized by ADB: 

b base address of data segment 

d length of the data segment 

s length of the stack 

t length of the text 

m execution type (407,410,411) 

In Figure 9 those variables not present are zero. Use can be made of these variables by 
expressions such as: 

<b 

in the address field. Similarly the value of the variable can be changed by an assignment 
request such as: 

02000>b 

that sets b to octal 2000. These variables are useful to know if the file under examination is an 
executable or core image file. 

ADB reads the header of the core image file to find the values for these variables. If the 
second file specified does not seem to be a core file, or if it is missing then the header of the 
executable file is used instead. 

5. Advanced Usage 

It is possible with ADB to combine formatting requests to provide elaborate displays. 
Below are several examples. 

5.1. Formatted dump 
The line: 

<b,-l/4o4 A 8Cn 

prints 4 octal words followed by their ASCII interpretation from the data space of the core 
image file. Broken down, the various request pieces mean: 

<b The base address of the data segment. 

<b,-l Print from the base address to the end of file. A negative count is 
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used here and elsewhere to loop indefinitely or until some error con¬ 
dition (like end of file) is detected. 

The format 4o4"8Cn is broken down as follows: 

4o Print 4 octal locations. 

4" Backup the current address 4 locations (to the original start of the 

field). 

8C Print 8 consecutive characters using an escape convention; each 

character in the range 0 to 037 is printed as @ followed by the 
corresponding character in the range 0140 to 0177. An @ is printed 
as @@. 

n Print a newline. 

The request: 

<b,<d/4o4*8Cn 

could have been used instead to allow the printing to stop at the end of the data segment (<d 
provides the data segment size in bytes). 

. The formatting requests can be combined with ADB’s ability to read in a script to pro¬ 
duce a core image dump script. ADB is invoked as: 

adb a.out core < dump 

to read in a script file, dump , of requests. An example of such a script is: 

120$w 
4095$s 
$v 
= 3n 
$m 

= 3n"C Stack Backtrace" 

$C 

= 3n"C External Variables" 

$e 

= 3n"Registers" 

$r 

0$s 

= 3n"Data Segment" 

<b,-l/8ona 

The request 120$w sets the width of the output to 120 characters (normally, the width 
is 80 characters). ADB attempts to print addresses as: 

symbol + offset 

The request 4095$s increases the maximum permissible offset to the nearest symbolic address 
from 255 (default) to 4095. The request = can be used to print literal strings. Thus, headings 
are provided in this dump program with requests of the form: 

= 3n"C Stack Backtrace" 

that spaces three lines and prints the literal string. The request $v prints all non-zero ADB 
variables (see Figure 8). The request 0$s sets the maximum offset for symbol matches to zero 
thus suppressing the printing of symbolic labels in favor of octal values. Note that this is only 


145 






done for the printing of the data segment. The request: 

<b,-l/8ona 

prints a dump from the base of the data segment to the end of file with an octal address field 
and eight octal numbers per line. 

Figure 11 shows the results of some formatting requests on the C program of Figure 10. 

5.2. Directory Dump 

As another illustration (Figure 12) consider a set of requests to dump the contents of a 
directory (which is made up of an integer inumber followed by a 14 character name): 

adb dir - 

= n8t , Tnum"8t"Name" 

0,-1? u8tl4cn 

In this example, the u prints the inumber as an unsigned decimal integer, the 8t means that 
ADB will space to the next multiple of 8 on the output line, and the 14c prints the 14 charac¬ 
ter file name. 

5.3. Ilist Dump 

Similarly the contents of the ilist of a file system, (e.g. /dev/src, on UNIX systems distri¬ 
buted by the UNIX Support Group; see UNIX Programmer’s Manual Section V) could be 
dumped with the following set of requests: 

adb /dev/src - 
02000>b 
?m <b 

<b,-l?"flags"8ton"links,uid,gid"8t3bn",size"8tbrdn"addr”8t8un"times"8t2Y2na 

In this example the value of the base for the map was changed to 02000 (by saying ?m<b) 
since that is the start of an ilist within a file system. An artifice (brd above) was used to print 
the 24 bit size field as a byte, a space, and a decimal integer. The last access time and last 
modify time are printed with the 2Y operator. Figure 12 shows portions of these requests as 
applied to a directory and file system. 

5.4. Converting values 

ADB may be used to convert values from one representation to another. For example: 

072 = odx 

will print 

072 58 #3a 

which is the octal, decimal and hexadecimal representations of 072 (octal). The format is 
remembered so that typing subsequent numbers will print them in the given formats. Charac¬ 
ter values may be converted similarly, for example: 

V = co 


prints 

a 0141 

It may also be used to evaluate expressions but be warned that all binary operators have the 
same precedence which is lower than that for unary operators. 
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6. Patching 

Patching files with ADB is accomplished with the write , w or W, request (which is not 
like the ed editor write command). This is often used in conjunction with the locate , 1 or L 
request. In general, the request syntax for 1 and w are similar as follows: 

?1 value 

The request 1 is used to match on two bytes, L is used for four bytes. The request w is used 
to write two bytes, whereas W writes four bytes. The value field in either locate or write 
requests is an expression. Therefore, decimal and octal numbers, or character strings are sup¬ 
ported. 

In order to modify a file, ADB must be called as: 

adb -w filel file2 

When called with this option, filel and file2 are created if necessary and opened for both read¬ 
ing and writing. 

For example, consider the C program shown in Figure 10. We can change the word 
"This" to "The " in the executable file for this program, ex7 y by using the following requests: 

adb -w ex7 - 
?1 ’Th’ 

?W ’The ’ 

The request ?1 starts at dot and stops at the first match of "Th" having set dot to the address 
of the location found. Note the use of ? to write to the a.out file. The form ?* would have 
been used for a 411 file. 

More frequently the request will be typed as: 

?1 ’Th’; ?s 

and locates the first occurrence of "Th" and print the entire string. Execution of this ADB 
request will set dot to the address of the "Th" characters. 

As another example of the utility of the patching facility, consider a C program that has 
an internal logic flag. The flag could be set by the user through ADB and the program run. 
For example: 

adb a.out - 
:s argl arg2 
flag/w 1 

:c 

The :s request is normally used to single step through a process or start a process in single 
step mode. In this case it starts a.out as a subprocess with arguments argl and arg2. if 
there is a subprocess running ADB writes to it rather than to the file so the w request causes 
flag to be changed in the memory of the subprocess. 

7. Anomalies 

Below is a list of some strange things that users should be aware of. 

1. Function calls and arguments are put on the stack by the C save routine. Putting break¬ 
points at the entry point to routines means that the function appears not to have been 
called when the breakpoint occurs. 

2. When printing addresses, ADB uses either text or data symbols from the a.out file. This 
sometimes causes unexpected symbol names to be printed with data (e.g. savr5+022). 
This does not happen if ? is used for text (instructions) and / for data. 

3. ADB cannot handle C register variables in the most recently activated function. 
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Figure 1: C program with pointer bug 


struct buf ( 

int fildes; 
int nleft; 
char *nextp; 
char buff[512]; 

)bb; 

struct buf *obuf; 

char *charp "this is a sentence."; 

main(argc,argv) 
int argc; 
char **argv; 

i 

char cc; 
if(argc < 2) j 

printf("Input file missing\n"); 
exit(8); 


if((fcreat(argv[l],obuf)) < 0)( 

printf("%s : not found\n", argv[l]); 
exit(8); 

i 

charp = T'; 

printf("debug 1 %s\n",charp); 

while(cc= *charp++) 
putc(cc,obuf); 
fflush(obuf); 





Figure 2: ADB output for C program of Figure 1 


adb a.out core 
$c 

~main(02,0177762) 

$C 

~main(02,0177762) 


argc: 

02 


argv: 

0177762 


cc: 

02124 


$r 

ps 0170010 

pc 0204 

~main4-0152 


sp 0177740 

r5 0177752 

r4 01 

r3 0 

r2 0 

rl 0 

rO 0124 

~main4-0152: 

mov _obuf,(sp) 


$e 

savr5: 0 

_obuf: 0 

_charp: 0124 

_errno: 0 

_fout: 0 

$m 

text map 'exl' 
bl = 0 

el = 02360 

fl = 020 

b2 = 0 

e2 = 02360 

f2 = 020 

data map 'corel' 


bl - 0 

el = 03500 

fl = 02000 

b2 = 0175400 

e2 = 0200000 

f2 = 05500 


*charp/s 

0124: TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^ 



Nh@x& 


charp/s 

_charp: T 


_charp+02: 

_charp4-026: 

main.argc/d 

0177756:2 


this is a sentence. 
Input file missing 


*main.argv/3o 

0177762:0177770 0177776 0177777 

0177770/s 

0177770: a.out 

*main.argv/3o 

0177762:0177770 0177776 0177777 

*"/8 


0177770: a.out 


.-10/d 

0177756:2 

$q 


0177770 
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Figure 3: Multiple function C program for stack trace illustration 


int fcnt,gcnt,hcnt; 
h(x,y) 

I 

int hi; register int hr; 
hi = x+1; 
hr = x-y+1; 
hcnt++ ; 
hj: 

f(hr,hi); 


g(p.q) 

I 

int gi; register int gr; 

gi = q-p; 

gr = q-p+1; 
gcnt++ ; 
gj: 

h(gr,gi); 


f(a,b) 

( 

int fi; register int fr; 
fi = a+2*b; 
fr = a+b; 
fcnt++ ; 

fj: 

g(fr.fi); 


main() 

( 

f(U); 
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Figure 4: ADB output for C program of Figure 3 


adb 

$c 

*h(04452,04451) 
^(04453,011124) 
'f(02,04451) 
'h(04450,04447) 
*g(04451,011120) 
*f(02,04447) 
*h(04446,04445) 
'g(04447,011114) 
'f(02,04445) 
*h(04444,04443) 
HIT DEL KEY 
adb 
,5$C 

*h(04452,04451) 


x: 

04452 

y: 

04451 

hi: 

? 

*g(04453,011124) 

P : 

04453 

q: 

011124 

gi: 

04451 

gr: 

? 

*f(02,04451) 

a: 

02 

b: 

04451 

fi: 

011124 

fr: 

04453 

*h(04450,04447) 

x: 

04450 

y: 

04447 

hi: 

04451 

hr: 

02 

*g(04451,011120) 

p: 

04451 

q^ 

011120 

gi: 

04447 

g r: 

04450 

fcnt/d 

_fcnt: 

1173 

gcnt/d 

gent: 

1173 

hent/d 

hent: 

1172 

h.x/d 

022004: 

2346 

$q 
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Figure 5: C program to decode tabs 


#define MAXLINE 80 

#define YES 1 

^define NO 0 

#define TABSP 8 


char input[] "data"; 

char ibuf[518]; 

int tabs[MAXLINE]; 

main() 

i 

int col, *ptab; 
char c; 

ptab = tabs; 

settab(ptab); /*Set initial tab stops */ 
col = 1; 

if(fopen(input,ibuf) < 0) j 

printf("%s : not found\n”,input); 
exit(8); 

) 

while((c = getc(ibuf)) != — 1) | 
switch(c) j 

case '\t': /* TAB */ 

while(tabpos(col) != YES) j 

putchar(' '); /* put BLANK */ 

col++ ; 

i 

break; 

case '\n': /♦NEWLINE */ 

putchar('\n'); 
col = 1; 
break; 

default: 

putchar(c); 
col++ ; 


/* Tabpos return YES if col is a tab stop */ 

tabpos(col) 

int col; 

I 

if(col > MAXLINE) 
return (YES); 

else 

return(tabs[col]); 

i 

/* Settab - Set initial tab stops */ 

settab(tabp) 

int *tabp; 

1 

int i; 

for(i = 0; i<= MAXLINE; i++) 

(i%TABSP) ? (tabs[i] = NO) : (tabs[i] = YES); 
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Figure 6a: ADB output for C program of Figure 5 


adb a.out — 

settab+4:b 

fopen+4:b 

getc+4:b 

tabpos+4:b 

$b 

breakpoints 


count 

bkpt 

command 

1 

~tabpos+04 


1 

_getc+04 


1 

_fopen+04 


1 

~settab+04 



settab,5?ia 



~settab: 

jsr 

r5,csv 

~settab+04: 

tst 

-(sp) 

~settab+06: 

clr 

0177770(r5) 

~settab+012: 

cmp 

$0120,0177770(r5) 

~settab+020: 

bit 

'settab+076 

~settab+022: 



settab,5?i 



~settab: 

jsr 

r5,csv 


tst 

-(sp) 


clr 

0177770(r5) 


cmp 

$0120,0177770(r5) 


bit 

'settab+076 

• 1 

a.out: running 



breakpoint 

~settab+04: tst — 


settab+4:d 


:c 

a.out: running 
breakpoint 
$C 

_fopen(02302,02472) 
~main(01,0177770) 


_fopen+04: 


col: 

c: 

ptab: 

tabs,3/8o 

03500: 


01 

0 

03500 


01 

01 

01 


0 

0 

0 


mov 04(r5),nulstr+012 


0 0 0 0 
0 0 0 0 
0 0 0 0 


0 

0 

0 


0 

0 

0 
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Figure 6b: ADB output for C program of Figure 5 


:c 


a.out: running 
breakpoint 

_getc+04: 

mov 

04(r5),rl 

ibuf+6/20c 

_cleanu+0202: 

:c 

a.out: running 

This 

is 

a test of 

breakpoint 

'tabpos+04: 

cmp 

$0120,04(r5) 


tabpos+4:d 

settab-l-4:b settab,5?ia 
settab+4:b settab,5?ia; 0 
getc+4,3:b main.c?C; 0 
settab+4:b settab,5?ia; ptab/o; 0 
$b 


breakpoints 
count bkpt 
1 ~tabpos+04 

3 getc+04 

1 _fopen+04 

1 ~settab+04 

'settab: jsr 

~settab+04: bpt 

~settab+06: clr 

~settab+012: cmp 

~settab+020: bit 

~settab+022: 
0177766:0177770 
0177744: 

T0177744: T 

h0177744: h 

i0177744: i 

S0177744: s 


command 

main.c?C;0 

settab,5?ia;ptab?o;0 

r5,csv 

0177770(r5) 

$0120,0177770(r5) 

~settab+076 
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Figure 7: ADB output for C program with breakpoints 

adb ex3 — 

h+4:b hcnt/d; h.hi/; h.hr/ 
g+4:b gcnt/d; g.gi/; g.gr/ 
f+4:b fcnt/d; f.fi/; f.fr/ 

:r 

ex3: running 

_fcnt: 0 

0177732: 214 

symbol not found 
f+4:b fcnt/d; f.a/; f.b/; f.fi/ 
g+4:b gcnt/d; g.p/; g.q/; g.gi/ 
h+4:b hcnt/d; h.x/; h.y/; h.hi/ 

:c 

ex3: running 

_fcnt: 0 

0177746: 1 

0177750: 1 

0177732: 214 

_gent: 0 

0177726: 2 

0177730: 3 

0177712: 214 

_hent: 0 

0177706: 2 

0177710: 1 

0177672: 214 

_fcnt: 1 

0177666: 2 

0177670: 3 

0177652: 214 

_gent: 1 

0177646: 5 

0177650: 8 

0177632: 214 

HIT DEL 

f+4:b fcnt/d; f.a/"a = "d; f.b/"b = "d; f.fi/ ’fi = "d 
g+4:b gcnt/d; g.p/"p = "d; g.q/"q = "d; g.gi/"gi = "d 
h+4:b hcnt/d; h.x/"x = "d; h.y/"h = "d; h.hi/"hi = "d 
:r 


ex3: running 


fcnt: 

0 

0177746 

a = 1 

0177750 

b = 1 

0177732 

fi = 214 

gent: 

0 

0177726 

p = 2 

0177730 

q = 3 

0177712 

gi = 214 

hent: 

0 

0177706 

x = 2 

0177710 

y = i 

0177672 

hi = 214 

fcnt: 

1 

0177666 

a = 2 

0177670 

b = 3 

0177652 

fi = 214 


HIT DEL 
$q 
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Figure 8: ADB address maps 

407 files 


a.out 

1 

core 

1 

hdr 

1 


text-1-data 


1 

1 

stack 

1 

0 

hdr 

1 


text-1-data 


D 


0 




D 

s 


E 

410 files (shared text) 








a.out 

1 

hdr 

1 


text 


1 


data 

1 


0 




T B 



D 

core 

1 

hdr 

1 

data 


.1 

stack 

1 




B 



D S 


E 



411 files (separated I and D space) 








a.out 

1 

hdr 

1 


text 


1 


data 

1 


0 




T 0 



D 

core 

1 

hdr 

1 

data 


.1 

stack 

1 




0 



D S 


E 



The following adb variables are set. 












407 

410 


411 


b 

base of data 



0 

B 


0 


d 

length of data 



D 

D-B 


D 


s 

length of stack 



S 

S 


S 


t 

length of text 



0 

T 


T 
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Figure 9: ADB output for maps 


adb map407 core407 
$m 

text map 'map407' 
bl = 0 el 

b2 = 0 e2 

data map v core407' 
bl = 0 el 

b2 = 0175400 e2 


= 0256 fl = 020 

= 0256 f2 = 020 

= 0300 fl = 02000 

= 0200000 f2 = 02300 


variables 
d = 0300 
m = 0407 
s = 02400 


adb map410 core410 
$m 

text map 'map410' 
bl = 0 el 

b2 = 020000 e2 

data map 'core410' 
bl = 020000 el 

b2 = 0175400 e2 

$v 

variables 
b = 020000 
d = 0200 
m = 0410 
s = 02400 
t = 0200 
$q 


= 0200 fl = 020 

= 020116 f2 = 0220 

= 020200 fl = 02000 
= 0200000 f2 = 02200 


adb map411 core411 
$m 

text map 'map4ir 


bl = 0 

el 

= 0200 

fl = 020 

b2 = 0 

e2 

= 0116 

f2 = 0220 

data map 

'core4ir 



bl = 0 

el 

= 0200 

fl = 02000 

b2 = 0175400 e2 

= 0200000 

f2 = 02200 

$v 





variables 
d = 0200 
m = 0411 
s = 02400 
t = 0200 

$q 


158 




Figure 10: Simple C program for illustrating formatting and patching 


char 

strl[] 

"This is a character string"; 

int 

one 

i; 

int 

number 

456; 

long 

Inum 

1234; 

float 

fpt 

1.25; 

char 

str2[] 

"This is the second character string"; 

main() 

( 

one = 2 

> 
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Figure 11: ADB output illustrating fancy formats 

adb map410 core410 
<b, — l/8ona 


020000: 0 

064124 

071551 

064440 020163 020141 

064143071141 

_strl+016: 061541 062564 020162 072163 064562 

063556 

002 

_number: 







_number: 0710 0 

02322 

040240 

0 064124 071551 

064440 


_str2+06:020163 

064164 

020145 

062563 067543 062156 

061440060550 

_str2+026: 060562 072143 071145 071440 071164 

067151 

01470 

savr5+02: 0 0 

0 0 

0 0 

0 o 




<b,20/4o4 8Cn 

020000: 0 

064124 

071551 

064440 @'@'This i 




020163 

020141 

064143 

071141 sa char 




061541 

062564 

020162 

072163 acter st 




064562 

063556 

0 02 

ring@'@'@b@' 




_number: 0710 0 

02322 

040240 

H@a@'@'R@d @@ 




0 064124 071551 064440 @'@'This i 




020163 

064164 

020145 

062563 s the se 




067543 

062156 

061440 

060550 cond cha 




060562 

072143 

071145 

071440 racter s 




071164 

067151 

0147 0 

tring@'@'@' 




0 0 

0 0 





0 0 

0 0 





data address not found 






<b,20/4o4*8t8cna 

020000: 0 

064124 

071551 

064440 This i 




strl+06:020163 

020141 

064143 

071141 s a char 





strl+016: 061541 062564 020162 072163 

strl+026: 064562 063556 0 02 ring 

number: 

number: 0710 0 02322 040240 

_fpt+02: 0 064124 071551 064440 

str2+06:020163 064164 020145 062563 


acter st 


_str2+016: 067543 

_str2+026: 060562 

_str2+036: 071164 

savr5+02: 0 0 0 

9avr5+012: 0 0 

data address not found 
<b,10/2b8t A 2cn 


062156 

072143 

067151 

0 

0 0 


061440 
071145 
0147 0 


HR 

This i 

s the se 
060550 cond cha 

071440 racter 

tring 


020000 : 

strl: 


0 


0124 0150 


0151 0163 

is 

040 0151 

i 

0163 040 

s 

0141 040 

a 

0143 0150 

ch 

0141 0162 

ar 

0141 0143 

ac 

0164 0145 

te 


Th 


$Q 
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Figure 12: Directory and inode dumps 
adb dir — 

= nt"Inode"t"Name" 

0,-l?utl4cn 

Inode Name 
0: 652 . 

82 .. 

5971 cap.c 
5323 cap 

0 pp 


adb /dev/src — 

02000>b 

?m<b 

new map 7dev/src' 

bl = 02000 el = 0100000000 fl = 0 

b2 = 0 e2 =0 f2 = 0 

$v 

variables 
b = 02000 

<b,— l?"flag8"8ton"links,uid,gid"8t3bn"8ize"8tbrdn"addr n 8t8un"time8"8t2Y2na 

02000: flags 073145 

links,uid.gid 0163 0164 0141 
size 0162 10356 

addr 28770 8236 25956 27766 25455 8236 25956 25206 

times 1976 Feb 5 08:34:56 1975 Dec 28 10:55:15 

02040: flags 024555 

links,uid.gid 012 0163 0164 

size 0162 25461 

addr 8308 30050 8294 25130 15216 26890 29806 10784 

times 1976 Aug 17 12:16:51 1976 Aug 17 12:16:51 

02100: flags 05173 

links,uid,gid 011 0162 0145 

size 0147 29545 

addr 25972 8306 28265 8308 25642 15216 2314 25970 

times 1977 Apr 2 08:58:01 1977 Feb 5 10:21:44 
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ADB Summary 


Command Summary 

a) formatted printing 


? format print from a.out file according to for¬ 
mat 

/ format print from core file according to for¬ 
mat 

= format print the value of dot 


?w expr write expression into a.out file 
/w expr write expression into core file 


?1 expr locate expression in a.out file 

b) breakpoint and program control 

:b set breakpoint at dot 

:c continue running program 

:d delete breakpoint 

:k kill the program being debugged 

:r run a.out file under ADB control 

:s single step 

c) miscellaneous printing 

$b print current breakpoints 

$c C stack trace 

$e external variables 

$f floating registers 

$m print ADB segment maps 

$q exit from ADB 

$r general registers 

$s set offset for symbol match 

$v print ADB variables 

$w set output line width 

d) calling the shell 

! call shell to read rest of line 

e) assignment to variables 

>name assign dot to variable or register name 


Format Summary 


a 

b 

c 

d 

f 

i 

o 

n 

r 

s 

nt 

u 

X 

Y 


the value of dot 

one byte in octal 

one byte as a character 

one word in decimal 

two words in floating point 

PDP 11 instruction 

one word in octal 

print a newline 

print a blank space 

a null terminated character string 

move to next n space tab 

one word as unsigned integer 

hexadecimal 

date 

backup dot 
print string 


Expression Summary 


a) expression components 


decimal integer 

octal integer 

hexadecimal 

symbols 

variables 

registers 

(expression) 


e.g. 256 
e.g. 0277 
e.g. #ff 

e.g. flag _main main.argc 
e.g. <b 
e.g. <pc <r0 
expression grouping 


b) dyadic operators 

4- add 

— subtract 

* multiply 

% integer division 

& bitwise and 

I bitwise or 

* round up to the next multiple 

c) monadic operators 

~ not 

* contents of location 

— integer negate 
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Yacc: Yet Another Compiler-Compiler 

Stephen C. Johnson 

Bell Laboratories 
Murray Hill, New Jersey 07974 

ABSTRACT 

Computer program input generally has some structure; in fact, every com¬ 
puter program that does input can be thought of as defining an “input 
language” which it accepts. An input language may be as complex as a pro¬ 
gramming language, or as simple as a sequence of numbers. Unfortunately, 
usual input facilities are limited, difficult to use, and often are lax about check¬ 
ing their inputs for validity. 

Yacc provides a general tool for describing the input to a computer pro¬ 
gram. The Yacc user specifies the structures of his input, together with code to 
be invoked as each such structure is recognized. Yacc turns such a 
specification into a subroutine that handles the input process; frequently, it is 
convenient and appropriate to have most of the flow of control in the user’s 
application handled by this subroutine. 

The input subroutine produced by Yacc calls a user-supplied routine to 
return the next basic input item. Thus, the user can specify his input in terms 
of individual input characters, or in terms of higher level constructs such as 
names and numbers. The user-supplied routine may also handle idiomatic 
features such as comment and continuation conventions, which typically defy 
easy grammatical specification. 

Yacc is written in portable C. The class of specifications accepted is a 
very general one: LALR(l) grammars with disambiguating rules. 

In addition to compilers for C, APL, Pascal, RATFOR, etc., Yacc has also 
been used for less conventional languages, including a phototypesetter 
language, several desk calculator languages, a document retrieval system, and a 
Fortran debugging system. 
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Yacc: Yet Another Compiler-Compiler 

Stephen C. Johnson 


0: Introduction 

Yacc provides a general tool for imposing structure on the input to a computer program. 
The Yacc user prepares a specification of the input process; this includes rules describing the 
input structure, code to be invoked when these rules are recognized, and a low-level routine to 
do the basic input. Yacc then generates a function to control the input process. This function, 
called a parser , calls the user-supplied low-level input routine (the lexical analyzer) to pick up 
the basic items (called tokens) from the input stream. These tokens are organized according to 
the input structure rules, called grammar rules ; when one of these rules has been recognized, 
then user code supplied for this rule, an action , is invoked; actions have the ability to return 
values and make use of the values of other actions. 

Yacc is written in a portable dialect of C 1 and the actions, and output subroutine, are in 
C as well. Moreover, many of the syntactic conventions of Yacc follow C. 

The heart of the input specification is a collection of grammar rules. Each rule describes 
an allowable structure and gives it a name. For example, one grammar rule might be 

date : month_name day Y year ; 

Here, date , month _ name , day , and year represent structures of interest in the input process; 

presumably, month _ name , day t and year are defined elsewhere. The comma is enclosed in 

single quotes; this implies that the comma is to appear literally in the input. The colon and 
semicolon merely serve as punctuation in the rule, and have no significance in controlling the 
input. Thus, with proper definitions, the input 

July 4, 1776 

might be matched by the above rule. 

An important part of the input process is carried out by the lexical analyzer. This user 
routine reads the input stream, recognizing the lower level structures, and communicates these 
tokens to the parser. For historical reasons, a structure recognized by the lexical analyzer is 
called a terminal symbol, while the structure recognized by the parser is called a nonterminal 
symbol. To avoid confusion, terminal symbols will usually be referred to as tokens. 

There is considerable leeway in deciding whether to recognize structures using the lexical 
analyzer or grammar rules. For example, the rules 

month name : 'J' 'a 'n ; 

month name : 'F' V V ; 


month_name : 'D' V 'c ; 

might be used in the above example. The lexical analyzer would only need to recognize indivi¬ 
dual letters, and month _ name would be a nonterminal symbol. Such low-level rules tend to 

waste time and space, and may complicate the specification beyond Yacc’s ability to deal with 
it. Usually, the lexical analyzer would recognize the month names, and return an indication 
that a month _ name was seen; in this case, month _ name would be a token. 
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Literal characters such as must also be passed through the lexical analyzer, and are 
also considered tokens. 

Specification files are very flexible. It is realively easy to add to the above example the 

rule 

date : month 7' day 7' year ; 
allowing 

7 / 4 / 1776 
as a synonym for 

July 4, 1776 

In most cases, this new rule could be “slipped in” to a working system with minimal effort, and 
little danger of disrupting existing input. 

The input being read may not conform to the specifications. These input errors are 
detected as early as is theoretically possible with a left-to-right scan; thus, not only is the 
chance of reading and computing with bad input data substantially reduced, but the bad data 
can usually be quickly found. Error handling, provided as part of the input specifications, per¬ 
mits the reentry of bad data, or the continuation of the input process after skipping over the 
bad data. 

In some cases, Yacc fails to produce a parser when given a set of specifications. For 
example, the specifications may be self contradictory, or they may require a more powerful 
recognition mechanism than that available to Yacc. The former cases represent design errors; 
the latter cases can often be corrected by making the lexical analyzer more powerful, or by 
rewriting some of the grammar rules. While Yacc cannot handle all possible specifications, its 
power compares favorably with similar systems; moreover, the constructions which are difficult 
for Yacc to handle are also frequently difficult for human beings to handle. Some users have 
reported that the discipline of formulating valid Yacc specifications for their input revealed 
errors of conception or design early in the program development. 

The theory underlying Yacc has been described elsewhere. 2,3,4 Yacc has been extensively 
used in numerous practical applications, including lint? the Portable C Compiler, 6 and a sys¬ 
tem for typesetting mathematics. 7 

The next several sections describe the basic process of preparing a Yacc specification; 
Section 1 describes the preparation of grammar rules, Section 2 the preparation of the user 
supplied actions associated with these rules, and Section 3 the preparation of lexical analyzers. 
Section 4 describes the operation of the parser. Section 5 discusses various reasons why Yacc 
may be unable to produce a parser from a specification, and what to do about it. Section 6 
describes a simple mechanism for handling operator precedences in arithmetic expressions. 
Section 7 discusses error detection and recovery. Section 8 discusses the operating environ¬ 
ment and special features of the parsers Yacc produces. Section 9 gives some suggestions 
which should improve the style and efficiency of the specifications. Section 10 discusses some 
advanced topics, and Section 11 gives acknowledgements. Appendix A has a brief example, 
and Appendix B gives a summary of the Yacc input syntax. Appendix C gives an example 
using some of the more advanced features of Yacc, and, finally, Appendix D describes mechan¬ 
isms and syntax no longer actively supported, but provided for historical continuity with older 
versions of Yacc. 

1: Basic Specifications 

Names refer to either tokens or nonterminal symbols. Yacc requires token names to be 
declared as such. In addition, for reasons discussed in Section 3, it is often desirable to include 
the lexical analyzer as part of the specification file; it may be useful to include other programs 
as well. Thus, every specification file consists of three sections: the declarations , (grammar) 
rules , and programs. The sections are separated by double percent “%%” marks. (The percent 
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is generally used in Yacc specifications as an escape character.) 

In other words, a full specification file looks like 

declarations 

%% 

rules 

%% 

programs 

The declaration section may be empty. Moreover, if the programs section is omitted, the 
second %% mark may be omitted also; thus, the smallest legal Yacc specification is 

%% 

rules 

Blanks, tabs, and newlines are ignored except that they may not appear in names or 
multi-character reserved symbols. Comments may appear wherever a name is legal; they are 
enclosed in /♦...♦/, as in C and PL/I. 

The rules section is made up of one or more grammar rules. A grammar rule has the 

form: 

A : BODY ; 

A represents a nonterminal name, and BODY represents a sequence of zero or more names and 
literals. The colon and the semicolon are Yacc punctuation. 

Names may be of arbitrary length, and may be made up of letters, dot underscore 
and non-initial digits. Upper and lower case letters are distinct. The names used in the 
body of a grammar rule may represent tokens or nonterminal symbols. 

A literal consists of a character enclosed in single quotes As in C, the backslash “\” 
is an escape character within literals, and all the C escapes are recognized. Thus 

'\n' newline 
'\r' return 
'Y' single quote 
'\Y backslash u \” 

'\t' tab 

'\b' backspace 

V form feed 

'\xxx' “xxx” in octal 

For a number of technical reasons, the NUL character ('\0' or 0) should never be used in gram¬ 
mar rules. 

If there are several grammar rules with the same left hand side, the vertical bar “I” can be 
used to avoid rewriting the left hand side. In addition, the semicolon at the end of a rule can 
be dropped before a vertical bar. Thus the grammar rules 

A BCD; 

A : E F ; 

A : G ; 

can be given to Yacc as 

A : BCD 

I E F 

I G 

> 

It is not necessary that all grammar rules with the same left side appear together in the gram¬ 
mar rules section, although it makes the input much more readable, and easier to change. 
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If a nonterminal symbol matches the empty string, this can be indicated in the obvious 

way: 

empty : ; 

Names representing tokens must be declared; this is most simply done by writing 
% token namel name2 . . . 

in the declarations section. (See Sections 3 , 5, and 6 for much more discussion). Every name 
not defined in the declarations section is assumed to represent a nonterminal symbol. Every 
nonterminal symbol must appear on the left side of at least one rule. 

Of all the nonterminal symbols, one, called the start symbol , has particular importance. 
The parser is designed to recognize the start symbol; thus, this symbol represents the largest, 
most general structure described by the grammar rules. By default, the start symbol is taken 
to be the left hand side of the first grammar rule in the rules section. It is possible, and in fact 
desirable, to declare the start symbol explicitly in the declarations section using the % start key¬ 
word: 

% start symbol 

The end of the input to the parser is signaled by a special token, called the endmarker. If 
the tokens up to, but not including, the endmarker form a structure which matches the start 
symbol, the parser function returns to its caller after the endmarker is seen; it accepts the 
input. If the endmarker is seen in any other context, it is an error. 

It is the job of the user-supplied lexical analyzer to return the endmarker when appropri¬ 
ate; see section 3, below. Usually the endmarker represents some reasonably obvious I/O 
status, such as “end-of-file” or “end-of-record”. 

2: Actions 

With each grammar rule, the user may associate actions to be performed each time the 
rule is recognized in the input process. These actions may return values, and may obtain the 
values returned by previous actions. Moreover, the lexical analyzer can return values for 
tokens, if desired. 

An action is an arbitrary C statement, and as such can do input and output, call subpro¬ 
grams, and alter external vectors and variables. An action is specified by one or more state¬ 
ments, enclosed in curly braces “{” and For example, 

A : T B T 

{ hello( 1, "abc" ); ) 

and 


XXX: YYY ZZZ 

{ printf("a message\n"); 
flag = 25; ) 

are grammar rules with actions. 

To facilitate easy communication between the actions and the parser, the action state¬ 
ments are altered slightly. The symbol “dollar sign” “$” is used as a signal to Yacc in this con¬ 
text. 

To return a value, the action normally sets the pseudo-variable “$$” to some value. For 
example, an action that does nothing but return the value 1 is 

{ $$ = l; I 

To obtain the values returned by previous actions and the lexical analyzer, the action 
may use the pseudo-variables $1, $2, . . ., which refer to the values returned by the components 
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of the right side of a rule, reading from left to right. Thus, if the rule is 
A : BCD; 

for example, then $2 has the value returned by C, and $3 the value returned by D. 

As a more concrete example, consider the rule 

expr : T expr T ; 

The value returned by this rule is usually the value of the expr in parentheses. This can be 
indicated by 

expr : '(' expr ')' { $$ = $2 ; } 

By default, the value of a rule is the value of the first element in it ($1). Thus, grammar 
rules of the form 

A : B ; 

frequently need not have an explicit action. 

In the examples above, all the actions came at the end of their rules. Sometimes, it is 
desirable to get control before a rule is fully parsed. Yacc permits an action to be written in 
the middle of a rule as well as at the end. This rule is assumed to return a value, accessible 
through the usual mechanism by the actions to the right of it. In turn, it may access the 
values returned by the symbols to its left. Thus, in the rule 

A : B 

t $$ = l; ) 

C 

j x = $2; y = $3; j 

» 

the effect is to set x to 1, and y to the value returned by C. 

Actions that do not terminate a rule are actually handled by Yacc by manufacturing a 
new nonterminal symbol name, and a new rule matching this name to the empty string. The 
interior action is the action triggered off by recognizing this added rule. Yacc actually treats 
the above example as if it had been written: 

$ACT : /* empty */ 

{$$ = !;) 


A : B $ACT C 

{ x — $2; y = $3; } 

> 

In many applications, output is not done directly by the actions; rather, a data structure, 
such as a parse tree, is constructed in memory, and transformations are applied to it before 
output is generated. Parse trees are particularly easy to construct, given routines to build and 
maintain the tree structure desired. For example, suppose there is a C function node , written 
so that the call 

node( L, nl, n2 ) 

creates a node with label L, and descendants nl and n2, and returns the index of the newly 
created node. Then parse tree can be built by supplying actions such as: 

expr : expr '+' expr 

{ $$ = node( '+', $1, $3 ); j 

in the specification. 
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The user may define other variables to be used by the actions. Declarations and 
definitions can appear in the declarations section, enclosed in the marks “%{” and These 

declarations and definitions have global scope, so they are known to the action statements and 
the lexical analyzer. For example, 

%{ int variable = 0; %\ 

could be placed in the declarations section, making variable accessible to all of the actions. 
The Yacc parser uses only names beginning in “yy”; the user should avoid such names. 

In these examples, all the values are integers: a discussion of values of other types will be 
found in Section 10. 

3: Lexical Analysis 

The user must supply a lexical analyzer to read the input stream and communicate 
tokens (with values, if desired) to the parser. The lexical analyzer is an integer-valued function 
called yylex. The function returns an integer, the token number , representing the kind of token 
read. If there is a value associated with that token, it should be assigned to the external vari¬ 
able yylval. 

The parser and the lexical analyzer must agree on these token numbers in order for com¬ 
munication between them to take place. The numbers may be chosen by Yacc, or chosen by 
the user. In either case, the “# define” mechanism of C is used to allow the lexical analyzer to 
return these numbers symbolically. For example, suppose that the token name DIGIT has 
been defined in the declarations section of the Yacc specification file. The relevant portion of 
the lexical analyzer might look like: 

yylex(){ 

extern int yylval; 

int c; 

c = getcharO; 

switch( c ) j 

case "0": 

case T': 

case '9': 

yylval = c-'O'; 
return( DIGIT ); 


The intent is to return a token number of DIGIT, and a value equal to the numerical 
value of the digit. Provided that the lexical analyzer code is placed in the programs section of 
the specification file, the identifier DIGIT will be defined as the token number associated with 
the token DIGIT. 

This mechanism leads to clear, easily modified lexical analyzers; the only pitfall is the 
need to avoid using any token names in the grammar that are reserved or significant in C or 
the parser; for example, the use of token names if or while will almost certainly cause severe 
difficulties when the lexical analyzer is compiled. The token name error is reserved for error 
handling, and should not be used naively (see Section 7). 

As mentioned above, the token numbers may be chosen by Yacc or by the user. In the 
default situation, the numbers are chosen by Yacc. The default token number for a literal 
character is the numerical value of the character in the local character set. Other names are 
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assigned token numbers starting at 257. 

To assign a token number to a token (including literals), the first appearance of the token 
name or literal in the declarations section can be immediately followed by a nonnegative 
integer. This integer is taken to be the token number of the name or literal. Names and 
literals not defined by this mechanism retain their default definition. It is important that all 
token numbers be distinct. 

For historical reasons, the endmarker must have token number 0 or negative. This token 
number cannot be redefined by the user; thus, all lexical analyzers should be prepared to return 
0 or negative as a token number upon reaching the end of their input. 

A very useful tool for constructing lexical analyzers is the Lex program developed by 
Mike Lesk. 8 These lexical analyzers are designed to work in close harmony with Yacc parsers. 
The specifications for these lexical analyzers use regular expressions instead of grammar rules. 
Lex can be easily used to produce quite complicated lexical analyzers, but there remain some 
languages (such as FORTRAN) which do not fit any theoretical framework, and whose lexical 
analyzers must be crafted by hand. 

4: How the Parser Works 

Yacc turns the specification file into a C program, which parses the input according to 
the specification given. The algorithm used to go from the specification to the parser is com¬ 
plex, and will not be discussed here (see the references for more information). The parser 
itself, however, is relatively simple, and understanding how it works, while not strictly neces¬ 
sary, will nevertheless make treatment of error recovery and ambiguities much more 
comprehensible. 

The parser produced by Yacc consists of a finite state machine with a stack. The parser 
is also capable of reading and remembering the next input token (called the lookahead token). 
The current state is always the one on the top of the stack. The states of the finite state 
machine are given small integer labels; initially, the machine is in state 0, the stack contains 
only state 0, and no lookahead token has been read. 

The machine has only four actions available to it, called shift, reduce , accept , and error. 
A move of the parser is done as follows: 

1. Based on its current state, the parser decides whether it needs a lookahead token to 
decide what action should be done; if it needs one, and does not have one, it calls yylex to 
obtain the next token. 

2. Using the current state, and the lookahead token if needed, the parser decides on its next 
action, and carries it out. This may result in states being pushed onto the stack, or 
popped off of the stack, and in the lookahead token being processed or left alone. 

The shift action is the most common action the parser takes. Whenever a shift action is 
taken, there is always a lookahead token. For example, in state 56 there may be an action: 

IF shift 34 

which says, in state 56, if the lookahead token is IF, the current state (56) is pushed down on 
the stack, and state 34 becomes the current state (on the top of the stack). The lookahead 
token is cleared. 

The reduce action keeps the stack from growing without bounds. Reduce actions are 
appropriate when the parser has seen the right hand side of a grammar rule, and is prepared to 
announce that it has seen an instance of the rule, replacing the right hand side by the left hand 
side. It may be necessary to consult the lookahead token to decide whether to reduce, but usu¬ 
ally it is not; in fact, the default action (represented by a “.”) is often a reduce action. 

Reduce actions are associated with individual grammar rules. Grammar rules are also 
given small integer numbers, leading to some confusion. The action 
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reduce 18 


refers to grammar rule 18, while the action 
IF shift 34 


refers to state 34. 

Suppose the rule being reduced is 
A : x y z ; 

The reduce action depends on the left hand symbol (A in this case), and the number of sym¬ 
bols on the right hand side (three in this case). To reduce, first pop off the top three states 
from the stack (In general, the number of states popped equals the number of symbols on the 
right side of the rule). In effect, these states were the ones put on the stack while recognizing 
x, y, and z, and no longer serve any useful purpose. After popping these states, a state is 
uncovered which was the state the parser was in before beginning to process the rule. Using 
this uncovered state, and the symbol on the left side of the rule, perform what is in effect a 
shift of A. A new state is obtained, pushed onto the stack, and parsing continues. There are 
significant differences between the processing of the left hand symbol and an ordinary shift of 
a token, however, so this action is called a goto action. In particular, the lookahead token is 
cleared by a shift, and is not affected by a goto. In any case, the uncovered state contains an 
entry such as: 


A goto 20 

causing state 20 to be pushed onto the stack, and become the current state. 

In effect, the reduce action “turns back the clock” in the parse, popping the states off the 
stack to go back to the state where the right hand side of the rule was first seen. The parser 
then behaves as if it had seen the left side at that time. If the right hand side of the rule is 
empty, no states are popped off of the stack: the uncovered state is in fact the current state. 

The reduce action is also important in the treatment of user-supplied actions and values. 
When a rule is reduced, the code supplied with the rule is executed before the stack is adjusted. 
In addition to the stack holding the states, another stack, running in parallel with it, holds the 
values returned from the lexical analyzer and the actions. When a shift takes place, the exter¬ 
nal variable yylval is copied onto the value stack. After the return from the user code, the 
reduction is carried out. When the goto action is done, the external variable yyval is copied 
onto the value stack. The pseudo-variables $1, $2, etc., refer to the value stack. 

The other two parser actions are conceptually much simpler. The accept action indicates 
that the entire input has been seen and that it matches the specification. This action appears 
only when the lookahead token is the endmarker, and indicates that the parser has successfully 
done its job. The error action, on the other hand, represents a place where the parser can no 
longer continue parsing according to the specification. The input tokens it has seen, together 
with the lookahead token, cannot be followed by anything that would result in a legal input. 
The parser reports an error, and attempts to recover the situation and resume parsing: the 
error recovery (as opposed to the detection of error) will be covered in Section 7. 

It is time for an example! Consider the specification 

% token DING DONG DELL 

%% 

rhyme : sound place 

sound : DING DONG 


place 


DELL 
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When Yacc is invoked with the — v option, a file called y.output is produced, with a 
human-readable description of the parser. The y.output file corresponding to the above gram¬ 
mar (with some statistics stripped off the end) is: 

state 0 

$accept : _rhyme $end 

DING shift 3 
. error 

rhyme goto 1 
sound goto 2 

state 1 

$accept : rhyme_$end 

$end accept 
. error 

state 2 

rhyme : sound_place 

DELL shift 5 
. error 

place goto 4 

state 3 

sound : DING DONG 

DONG shift 6 
. error 

state 4 

rhyme : sound place_ (1) 

reduce 1 

state 5 

place : DELL_ (3) 

. reduce 3 
state 6 

sound : DING DONG (2) 

reduce 2 

Notice that, in addition to the actions for each state, there is a description of the parsing rules 

being processed in each state. The _ character is used to indicate what has been seen, and 

what is yet to come, in each rule. Suppose the input is 

DING DONG DELL 

It is instructive to follow the steps of the parser while processing this input. 
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Initially, the current state is state 0. The parser needs to refer to the input in order to 
decide between the actions available in state 0, so the first token, DING , is read, becoming the 
lookahead token. The action in state 0 on DING is is “shift 3”, so state 3 is pushed onto the 
stack, and the lookahead token is cleared. State 3 becomes the current state. The next token, 
DONG , is read, becoming the lookahead token. The action in state 3 on the token DONG is 
“shift 6”, so state 6 is pushed onto the stack, and the lookahead is cleared. The stack now 
contains 0, 3, and 6. In state 6, without even consulting the lookahead, the parser reduces by 
rule 2. 

sound : DING DONG 

This rule has two symbols on the right hand side, so two states, 6 and 3, are popped off of the 
stack, uncovering state 0. Consulting the description of state 0, looking for a goto on sound , 

sound goto 2 

is obtained; thus state 2 is pushed onto the stack, becoming the current state. 

In state 2, the next token, DELL , must be read. The action is “shift 5”, so state 5 is 
pushed onto the stack, which now has 0, 2, and 5 on it, and the lookahead token is cleared. In 
state 5, the only action is to reduce by rule 3. This has one symbol on the right hand side, so 
one state, 5, is popped off, and state 2 is uncovered. The goto in state 2 on place , the left side 
of rule 3, is state 4. Now, the stack contains 0, 2, and 4. In state 4, the only action is to 
reduce by rule 1. There are two symbols on the right, so the top two states are popped off, 
uncovering state 0 again. In state 0, there is a goto on rhyme causing the parser to enter state 
1. In state 1, the input is read; the endmarker is obtained, indicated by “$end” in the y. output 
file. The action in state 1 when the endmarker is seen is to accept, successfully ending the 
parse. 

The reader is urged to consider how the parser works when confronted with such 
incorrect strings as DING DONG DONG , DING DONG , DING DONG DELL DELL , etc. A 
few minutes spend with this and other simple examples will probably be repaid when problems 
arise in more complicated contexts. 

5: Ambiguity and Conflicts 

A set of grammar rules is ambiguous if there is some input string that can be structured 
in two or more different ways. For example, the grammar rule 

expr : expr expr 

is a natural way of expressing the fact that one way of forming an arithmetic expression is to 
put two other expressions together with a minus sign between them. Unfortunately, this gram¬ 
mar rule does not completely specify the way that all complex inputs should be structured. For 
example, if the input is 

expr - expr - expr 

the rule allows this input to be structured as either 
( expr - expr ) - expr 

or as 

expr - ( expr - expr ) 

(The first is called left association , the second right association). 

Yacc detects such ambiguities when it is attempting to build the parser. It is instructive 
to consider the problem that confronts the parser when it is given an input such as 

expr - expr - expr 

When the parser has read the second expr, the input that it has seen: 
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expr - expr 

matches the right side of the grammar rule above. The parser could reduce the input by apply¬ 
ing this rule; after applying the rule; the input is reduced to expr( the left side of the rule). The 
parser would then read the final part of the input: 

- expr 

and again reduce. The effect of this is to take the left associative interpretation. 

Alternatively, when the parser has seen 
expr - expr 

it could defer the immediate application of the rule, and continue reading the input until it had 
seen 


expr - expr - expr 

It could then apply the rule to the rightmost three symbols, reducing them to expr and leaving 
expr - expr 

Now the rule can be reduced once more; the effect is to take the right associative interpreta¬ 
tion. Thus, having read 

expr - expr 

the parser can do two legal things, a shift or a reduction, and has no way of deciding between 
them. This is called a shift / reduce conflict. It may also happen that the parser has a choice 
of two legal reductions; this is called a reduce / reduce conflict. Note that there are never any 
“Shift/shift” conflicts. 

When there are shift/reduce or reduce/reduce conflicts, Yacc still produces a parser. It 
does this by selecting one of the valid steps wherever it has a choice. A rule describing which 
choice to make in a given situation is called a disambiguating rule. 

Yacc invokes two disambiguating rules by default: 

1. In a shift/reduce conflict, the default is to do the shift. 

2. In a reduce/reduce conflict, the default is to reduce by the earlier grammar rule (in the 
input sequence). 

Rule 1 implies that reductions are deferred whenever there is a choice, in favor of shifts. 
Rule 2 gives the user rather crude control over the behavior of the parser in this situation, but 
reduce/reduce conflicts should be avoided whenever possible. 

Conflicts may arise because of mistakes in input or logic, or because the grammar rules, 
while consistent, require a more complex parser than Yacc can construct. The use of actions 
within rules can also cause conflicts, if the action must be done before the parser can be sure 
which rule is being recognized. In these cases, the application of disambiguating rules is inap¬ 
propriate, and leads to an incorrect parser. For this reason, Yacc always reports the number of 
shift/reduce and reduce/reduce conflicts resolved by Rule 1 and Rule 2. 

In general, whenever it is possible to apply disambiguating rules to produce a correct 
parser, it is also possible to rewrite the grammar rules so that the same inputs are read but 
there are no conflicts. For this reason, most previous parser generators have considered 
conflicts to be fatal errors. Our experience has suggested that this rewriting is somewhat unna¬ 
tural, and produces slower parsers; thus, Yacc will produce parsers even in the presence of 
conflicts. 

As an example of the power of disambiguating rules, consider a fragment from a pro¬ 
gramming language involving an “if-then-else” construction: 




stat : 


IF T cond y stat 

IF '(' cond y stat ELSE stat 




In these rules, IF and ELSE are tokens, cond is a nonterminal symbol describing conditional 
(logical) expressions, and stat is a nonterminal symbol describing statements. The first rule 
will be called the simple-if rule, and the second the if-else rule. 

These two rules form an ambiguous construction, since input of the form 

IF ( Cl ) IF ( C2 ) SI ELSE S2 

can be structured according to these rules in two ways: 

IF ( Cl ) | 

IF ( C2 ) SI 

i ‘ 

ELSE S2 
or 

IF ( Cl ) j 

IF ( C2 ) SI 
ELSE S2 

1 

The second interpretation is the one given in most programming languages having this con¬ 
struct. Each ELSE is associated with the last preceding “un-ELSE'd” IF. In this example, 
consider the situation where the parser has seen 

IF ( Cl ) IF ( C2 ) SI 

and is looking at the ELSE. It can immediately reduce by the simple-if rule to get 
IF ( Cl ) stat 

and then read the remaining input, 

ELSE S2 
and reduce 

IF ( Cl ) stat ELSE S2 

by the if-else rule. This leads to the first of the above groupings of the input. 

On the other hand, the ELSE may be shifted, S2 read, and then the right hand portion of 
IF ( Cl ) IF ( C2 ) SI ELSE S2 
can be reduced by the if-else rule to get 
IF ( Cl ) stat 

which can be reduced by the simple-if rule. This leads to the second of the above groupings of 
the input, which is usually desired. 

Once again the parser can do two valid things - there is a shift/reduce conflict. The 
application of disambiguating rule 1 tells the parser to shift in this case, which leads to the 
desired grouping. 

This shift/reduce conflict arises only when there is a particular current input symbol, 
ELSE , and particular inputs already seen, such as 

IF ( Cl ) IF ( C2 ) SI 

In general, there may be many conflicts, and each one will be associated with an input symbol 
and a set of previously read inputs. The previously read inputs are characterized by the state 
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t 


of the parser. 

The conflict messages of Yacc are best understood by examining the verbose (-v) option 
output file. For example, the output corresponding to the above conflict state might be: 

23: shift/reduce conflict (shift 45, reduce 18) on ELSE 
state 23 

stat : IF ( cond ) stat_ (18) 

stat : IF ( cond ) stat_ELSE stat 

ELSE shift 45 
reduce 18 

i 

The first line describes the conflict, giving the state and the input symbol. The ordinary state 
description follows, giving the grammar rules active in the state, and the parser actions. Recall 
that the underline marks the portion of the grammar rules which has been seen. Thus in the 
example, in state 23 the parser has seen input corresponding to 

IF ( cond ) stat 

and the two grammar rules shown are active at this time. The parser can do two possible 
things. If the input symbol is ELSE , it is possible to shift into state 45. State 45 will have, as 
part of its description, the line 

stat : IF ( cond ) stat ELSE_stat 

since the ELSE will have been shifted in this state. Back in state 23, the alternative action, 
described by is to be done if the input symbol is not mentioned explicitly in the above 
actions; thus, in this case, if the input symbol is not ELSE , the parser reduces by grammar 
rule 18: 

stat : IF T cond ')' stat 

Once again, notice that the numbers following “shift” commands refer to other states, while 
the numbers following “reduce” commands refer to grammar rule numbers. In the y.output 
file, the rule numbers are printed after those rules which can be reduced. In most one states, 
there will be at most reduce action possible in the state, and this will be the default command. 
The user who encounters unexpected shift/reduce conflicts will probably want to look at the 
verbose output to decide whether the default actions are appropriate. In really tough cases, the 
user might need to know more about the behavior and construction of the parser than can be 
covered here. In this case, one of the theoretical references 2 ’ 3 ’ 4 might be consulted; the ser¬ 
vices of a local guru might also be appropriate. 

6: Precedence 

There is one common situation where the rules given above for resolving conflicts are not 
sufficient; this is in the parsing of arithmetic expressions. Most of the commonly used con¬ 
structions for arithmetic expressions can be naturally described by the notion of precedence 
levels for operators, together with information about left or right associativity. It turns out 
that ambiguous grammars with appropriate disambiguating rules can be used to create parsers 
that are faster and easier to write than parsers constructed from unambiguous grammars. The 
basic notion is to write grammar rules of the form 

expr : expr OP expr 

and 

expr : UNARY expr 
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for all binary and unary operators desired. This creates a very ambiguous grammar, with many 
parsing conflicts. As disambiguating rules, the user specifies the precedence, or binding 
strength, of all the operators, and the associativity of the binary operators. This information is 
sufficient to allow Yacc to resolve the parsing conflicts in accordance with these rules, and con¬ 
struct a parser that realizes the desired precedences and associativities. 

The precedences and associativities are attached to tokens in the declarations section. 
This is done by a series of lines beginning with a Yacc keyword: %left, %right, or %nonassoc, 
followed by a list of tokens. All of the tokens on the same line are assumed to have the same 
precedence level and associativity; the lines are listed in order of increasing precedence or bind¬ 
ing strength. Thus, 

%left '+' 

%left 7' 

describes the precedence and associativity of the four arithmetic operators. Plus and minus 
are left associative, and have lower precedence than star and slash, which are also left associa¬ 
tive. The keyword % right is used to describe right associative operators, and the keyword 
%nonassoc is used to describe operators, like the operator .LT. in Fortran, that may not asso¬ 
ciate with themselves; thus, 

A .LT. B .LT. C 

is illegal in Fortran, and such an operator would be described with the keyword %nonassoc in 
Yacc. As an example of the behavior of these declarations, the description 


% right ' = ' 
%left '+' 

%left 7' 


%% 

expr : expr 

'=' expr 

1 expr 

'+' expr 

1 expr 

expr 

1 expr 

Tfc' expr 

1 expr 

7' expr 

1 NAME 


might be used to structure the input 

a = b = c+d - e - f+g 
as follows: 

a = ( b = ( ((c*d)-e) - (f*g) ) ) 

When this mechanism is used, unary operators must, in general, be given a precedence. Some¬ 
times a unary operator and a binary operator have the same symbolic representation, but 
different precedences. An example is unary and binary unary minus may be given the same 
strength as multiplication, or even higher, while binary minus has a lower strength than multi¬ 
plication. The keyword, %prec, changes the precedence level associated with a particular 
grammar rule. %prec appears immediately after the body of the grammar rule, before the 
action or closing semicolon, and is followed by a token name or literal. It causes the pre¬ 
cedence of the grammar rule to become that of the following token name or literal. For exam¬ 
ple, to make unary minus have the same precedence as multiplication the rules might resemble: 
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%left 

%left 

7' 


%% 

expr : 

expr '+' 

expr 

1 

expr 

expr 

1 

expr 

expr 

1 

expr '/' 

expr 

1 

1 

expr 

NAME 

%prec 


A token declared by %left, %right, and %nonassoc need not be, but may be, declared by 
%token as well. 

The precedences and associativities are used by Yacc to resolve parsing conflicts; they 
give rise to disambiguating rules. Formally, the rules work as follows: 

1. The precedences and associativities are recorded for those tokens and literals that have 
them. 

2. A precedence and associativity is associated with each grammar rule; it is the precedence 
and associativity of the last token or literal in the body of the rule. If the %prec con¬ 
struction is used, it overrides this default. Some grammar rules may have no precedence 
and associativity associated with them. 

3. When there is a reduce/reduce conflict, or there is a shift/reduce conflict and either the 
input symbol or the grammar rule has no precedence and associativity, then the two 
disambiguating rules given at the beginning of the section are used, and the conflicts are 
reported. 

4. If there is a shift/reduce conflict, and both the grammar rule and the input character 
have precedence and associativity associated with them, then the conflict is resolved in 
favor of the action (shift or reduce) associated with the higher precedence. If the pre¬ 
cedences are the same, then the associativity is used; left associative implies reduce, right 
associative implies shift, and nonassociating implies error. 

Conflicts resolved by precedence are not counted in the number of shift/reduce and 
reduce/reduce conflicts reported by Yacc. This means that mistakes in the specification of pre¬ 
cedences may disguise errors in the input grammar; it is a good idea to be sparing with pre¬ 
cedences, and use them in an essentially “cookbook” fashion, until some experience has been 
gained. The y.output file is very useful in deciding whether the parser is actually doing what 
was intended. 

7: Error Handling 

Error handling is an extremely difficult area, and many of the problems are semantic 
ones. When an error is found, for example, it may be necessary to reclaim parse tree storage, 
delete or alter symbol table entries, and, typically, set switches to avoid generating any further 
output. 

It is seldom acceptable to stop all processing when an error is found; it is more useful to 
continue scanning the input to find further syntax errors. This leads to the problem of getting 
the parser “restarted” after an error. A general class of algorithms to do this involves discard¬ 
ing a number of tokens from the input string, and attempting to adjust the parser so that input 
can continue. 

To allow the user some control over this process, Yacc provides a simple, but reasonably 
general, feature. The token name “error” is reserved for error handling. This name can be 
used in grammar rules; in effect, it suggests places where errors are expected, and recovery 
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might take place. The parser pops its stack until it enters a state where the token “error” is 
legal. It then behaves as if the token “error” were the current lookahead token, and performs 
the action encountered. The lookahead token is then reset to the token that caused the error. 
If no special error rules have been specified, the processing halts when an error is detected. 

In order to prevent a cascade of error messages, the parser, after detecting an error, 
remains in error state until three tokens have been successfully read and shifted. If an error is 
detected when the parser is already in error state, no message is given, and the input token is 
quietly deleted. 

As an example, a rule of the form 
stat : error 

would, in effect, mean that on a syntax error the parser would attempt to skip over the state¬ 
ment in which the error was seen. More precisely, the parser will scan ahead, looking for three 
tokens that might legally follow a statement, and start processing at the first of these; if the 
beginnings of statements are not sufficiently distinctive, it may make a false start in the middle 
of a statement, and end up reporting a second error where there is in fact no error. 

Actions may be used with these special error rules. These actions might attempt to reini¬ 
tialize tables, reclaim symbol table space, etc. 

Error rules such as the above are very general, but difficult to control. Somewhat easier 
are rules such as 

stat : error 

Here, when there is an error, the parser attempts to skip over the statement, but will do so by 
skipping to the next All tokens after the error and before the nextcannot be shifted, and 
are discarded. When the is seen, this rule will be reduced, and any “cleanup” action associ¬ 
ated with it performed. 

Another form of error rule arises in interactive applications, where it may be desirable to 
permit a line to be reentered after an error. A possible error rule might be 

input: error '\n' { printf( "Reenter last line: " ); ) input 

f $$ = $4; ) 

There is one potential difficulty with this approach; the parser must correctly process three 
input tokens before it admits that it has correctly resynchronized after the error. If the reen¬ 
tered line contains an error in the first two tokens, the parser deletes the offending tokens, and 
gives no message; this is clearly unacceptable. For this reason, there is a mechanism that can 
be used to force the parser to believe that an error has been fully recovered from. The state¬ 
ment 


yyerrok ; 

in an action resets the parser to its normal mode. The last example is better written 


input: 


error '\n' 

{ yyerrok; 

printf( "Reenter last line: " ); ) 

input 

{ $$ = $4; ) 


As mentioned above, the token seen immediately after the “error” symbol is the input 
token at which the error was discovered. Sometimes, this is inappropriate; for example, an 
error recovery action might take upon itself the job of finding the correct place to resume 
input. In this case, the previous lookahead token must be cleared. The statement 
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yyclearin ; 

in an action will have this effect. For example, suppose the action after error were to call some 
sophisticated resynchronization routine, supplied by the user, that attempted to advance the 
input to the beginning of the next valid statement. After this routine was called, the next 
token returned by yylex would presumably be the first token in a legal statement; the old, ille¬ 
gal token must be discarded, and the error state reset. This could be done by a rule like 

stat : error 

{ resynch(); 
yyerrok ; 
yyclearin ; } 

» 

These mechanisms are admittedly crude, but do allow for a simple, fairly effective 
recovery of the parser from many errors; moreover, the user can get control to deal with the 
error actions required by other portions of the program. 

8: The Yacc Environment 

When the user inputs a specification to Yacc, the output is a file of C programs, called 
y.tab.c on most systems (due to local file system conventions, the names may differ from instal¬ 
lation to installation). The function produced by Yacc is called yyparse; it is an integer valued 
function. When it is called, it in turn repeatedly calls yylex, the lexical analyzer supplied by the 
user (see Section 3) to obtain input tokens. Eventually, either an error is detected, in which 
case (if no error recovery is possible) yyparse returns the value 1, or the lexical analyzer 
returns the endmarker token and the parser accepts. In this case, yyparse returns the value 0. 

The user must provide a certain amount of environment for this parser in order to obtain 
a working program. For example, as with every C program, a program called main must be 
defined, that eventually calls yyparse. In addition, a routine called yyerror prints a message 
when a syntax error is detected. 

These two routines must be supplied in one form or another by the user. To ease the ini¬ 
tial effort of using Yacc, a library has been provided with default versions of main and yyerror. 
The name of this library is system dependent; on many systems the library is accessed by a 
-ly argument to the loader. To show the triviality of these default programs, the source is 
given below: 

main(){ 

return( yyparse() ); 

I 

and 

§ include <stdio.h> 

yyerror(s) char *s; { 

fprintf( stderr, "%s\n", s ); 

I 

The argument to yyerror is a string containing an error message, usually the string “syntax 
error”. The average application will want to do better than this. Ordinarily, the program 
should keep track of the input line number, and print it along with the message when a syntax 
error is detected. The external integer variable yychar contains the lookahead token number at 
the time the error was detected; this may be of some interest in giving better diagnostics. 
Since the main program is probably supplied by the user (to read arguments, etc.) the Yacc 
library is useful only in small projects, or in the earliest stages of larger ones. 
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The external integer variable yydebug is normally set to 0. If it is set to a nonzero value, 
the parser will output a verbose description of its actions, including a discussion of which input 
symbols have been read, and what the parser actions are. Depending on the operating environ¬ 
ment, it may be possible to set this variable by using a debugging system. 

9: Hints for Preparing Specifications 

This section contains miscellaneous hints on preparing efficient, easy to change, and clear 
specifications. The individual subsections are more or less independent. 

Input Style 

It is difficult to provide rules with substantial actions and still have a readable 
specification file. The following style hints owe much to Brian Kernighan. 

a. Use all capital letters for token names, all lower case letters for nonterminal names. This 
rule comes under the heading of “knowing who to blame when things go wrong.” 

b. Put grammar rules and actions on separate lines. This allows either to be changed 
without an automatic need to change the other. 

c. Put all rules with the same left hand side together. Put the left hand side in only once, 
and let all following rules begin with a vertical bar. 

d. Put a semicolon only after the last rule with a given left hand side, and put the semicolon 
on a separate line. This allows new rules to be easily added. 

e. Indent rule bodies by two tab stops, and action bodies by three tab stops. 

The example in Appendix A is written following this style, as are the examples in the text 
of this paper (where space permits). The user must make up his own mind about these stylis¬ 
tic questions; the central problem, however, is to make the rules visible through the morass of 
action code. 

Left Recursion 

The algorithm used by the Yacc parser encourages so called “left recursive” grammar 
rules: rules of the form 

name: name rest_of_rule ; 

These rules frequently arise when writing specifications of sequences and lists: 

list : item 

I list V item 

y 

and 

seq : item 

I seq item 


In each of these cases, the first rule will be reduced for the first item only, and the second rule 
will be reduced for the second and all succeeding items. 

With right recursive rules, such as 

seq : item 

I item seq 

the parser would be a bit bigger, and the items would be seen, and reduced, from right to left. 
More seriously, an internal stack in the parser would be in danger of overflowing if a very long 
sequence were read. Thus, the user should use left recursion wherever reasonable. 
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It is worth considering whether a sequence with zero elements has any meaning, and if 
so, consider writing the sequence specification with an empty rule: 

seq : /* empty */ 

I seq item 

y 

Once again, the first rule would always be reduced exactly once, before the first item was read, 
and then the second rule would be reduced once for each item read. Permitting empty 
sequences often leads to increased generality. However, conflicts might arise if Yacc is asked 
to decide which empty sequence it has seen, when it hasn’t seen enough to know! 

Lexical Tie-ins 

Some lexical decisions depend on context. For example, the lexical analyzer might want 
to delete blanks normally, but not within quoted strings. Or names might be entered into a 
symbol table in declarations, but not in expressions. 

One way of handling this situation is to create a global flag that is examined by the lexi¬ 
cal analyzer, and set by actions. For example, suppose a program consists of 0 or more 
declarations, followed by 0 or more statements. Consider: 

v 

int dflag; 

%) 

... other declarations ... 


%% 

prog : decls stats 


decls : /* empty */ 

{ dflag = 1; } 
I decls declaration 


stats : /* empty */ 

{ dflag = 0; } 
I stats statement 


... other rules ... 

The flag dflag is now 0 when reading statements, and 1 when reading declarations, except for 
the first token in the first statement. This token must be seen by the parser before it can tell 
that the declaration section has ended and the statements have begun. In many cases, this sin¬ 
gle token exception does not affect the lexical scan. 

This kind of “backdoor” approach can be elaborated to a noxious degree. Nevertheless, it 
represents a way of doing some things that are difficult, if not impossible, to do otherwise. 

Reserved Words 

Some programming languages permit the user to use words like “if’, which are normally 
reserved, as label or variable names, provided that such use does not conflict with the legal use 
of these names in the programming language. This is extremely hard to do in the framework 
of Yacc; it is difficult to pass information to the lexical analyzer telling it “this instance of ‘if 
is a keyword, and that instance is a variable”. The user can make a stab at it, using the 
mechanism described in the last subsection, but it is difficult. 
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A number of ways of making this easier are under advisement. Until then, it is better 
that the keywords be reserved ; that is, be forbidden for use as variable names. There are 
powerful stylistic reasons for preferring this, anyway. 


10: Advanced Topics 

This section discusses a number of advanced features of Yacc. 

Simulating Error and Accept in Actions 

The parsing actions of error and accept can be simulated in an action by use of macros 
YYACCEPT and YYERROR. YYACCEPT causes yyparse to return the value 0; YYERROR 
causes the parser to behave as if the current input symbol had been a syntax error; yyerror is 
called, and error recovery takes place. These mechanisms can be used to simulate parsers with 
multiple endmarkers or context-sensitive syntax checking. 

Accessing Values in Enclosing Rules. 

An action may refer to values returned by actions to the left of the current rule. The 
mechanism is simply the same as with ordinary actions, a dollar sign followed by a digit, but in 
this case the digit may be 0 or negative. Consider 

sent : adj noun verb adj noun 

{ look at the sentence . . . ) 


adj : 

THE | 

$$ = THE; | 

1 

YOUNG | 

$$ = YOUNG; 


noun: 

DOG 



I 

$$ = DOG; ) 

1 

CRONE 



( 

if( $0 = = YOUNG ){ 



printf( "what?\n 



J 

$$ = CRONE; 


In the action following the word CRONE, a check is made that the preceding token shifted was 
not YOUNG. Obviously, this is only possible when a great deal is known about what might 
precede the symbol noun in the input. There is also a distinctly unstructured flavor about this. 
Nevertheless, at times this mechanism will save a great deal of trouble, especially when a few 
combinations are to be excluded from an otherwise regular structure. 

Support for Arbitrary Value Types 

By default, the values returned by actions and the lexical analyzer are integers. Yacc can 
also support values of other types, including structures. In addition, Yacc keeps track of the 
types, and inserts appropriate union member names so that the resulting parser will be strictly 
type checked. The Yacc value stack (see Section 4) is declared to be a union of the various 
types of values desired. The user declares the union, and associates union member names to 
each token and nonterminal symbol having a value. When the value is referenced through a $$ 
or $n construction, Yacc will automatically insert the appropriate union name, so that no 
unwanted conversions will take place. In addition, type checking commands such as Lint$ will 
be far more silent. 
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There are three mechanisms used to provide for this typing. First, there is a way of 
defining the union; this must be done by the user since other programs, notably the lexical 
analyzer, must know about the union member names. Second, there is a way of associating a 
union member name with tokens and nonterminals. Finally, there is a mechanism for describ¬ 
ing the type of those few values where Yacc can not easily determine the type. 

To declare the union, the user includes in the declaration section: 

%union j 

body of union ... 

) 

This declares the Yacc value stack, and the external variables yylval and yyval, to have type 
equal to this union. If Yacc was invoked with the -d option, the union declaration is copied 
onto the y.tab.h file. Alternatively, the union may be declared in a header file, and a typedef 
used to define the variable YYSTYPE to represent this union. Thus, the header file might also 
have said: 

typedef union j 

body of union ... 

) YYSTYPE; 

The header file must be included in the declarations section, by use of %( and %). 

Once YYSTYPE is defined, the union member names must be associated with the various 
terminal and nonterminal names. The construction 

< name > 

is used to indicate a union member name. If this follows one of the keywords %token, %left, 
%right, and %nonassoc, the union member name is associated with the tokens listed. Thus, 
saying 

%left <optype> '+' 

will cause any reference to values returned by these two tokens to be tagged with the union 
member name optype. Another keyword, %type, is used similarly to associate union member 
names with nonterminals. Thus, one might say 

%type <nodetype> expr stat 

There remain a couple of cases where these mechanisms are insufficient. If there is an 
action within a rule, the value returned by this action has no a priori type. Similarly, reference 
to left context values (such as $0 - see the previous subsection ) leaves Yacc with no easy way 
of knowing the type. In this case, a type can be imposed on the reference by inserting a union 
member name, between < and >, immediately after the first $. An example of this usage is 

rule : aaa j $<intval>$ = 3; ) bbb 

j fun( $<intval>2, $<other>0 ); | 

» 

This syntax has little to recommend it, but the situation arises rarely. 

A sample specification is given in Appendix C. The facilities in this subsection are not 
triggered until they are used: in particular, the use of %type will turn on these mechanisms. 
When they are used, there is a fairly strict level of checking. For example, use of $n or $$ to 
refer to something with no defined type is diagnosed. If these facilities are not triggered, the 
Yacc value stack is used to hold inf’s, as was true historically. 
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Appendix A: A Simple Example 

This example gives the complete Yacc specification for a small desk calculator; the desk 
calculator has 26 registers, labeled “a” through “z”, and accepts arithmetic expressions made 
up of the operators +, *, /, % (mod operator), & (bitwise and), I (bitwise or), and assign¬ 
ment. If an expression at the top level is an assignment, the value is not printed; otherwise it 
is. As in C, an integer that begins with 0 (zero) is assumed to be octal; otherwise, it is assumed 
to be decimal. 

As an example of a Yacc specification, the desk calculator does a reasonable job of show¬ 
ing how precedences and ambiguities are used, and demonstrating simple error recovery. The 
major oversimplifications are that the lexical analysis phase is much simpler than for most 
applications, and the output is produced immediately, line by line. Note the way that decimal 
and octal integers are read in by the grammar rules; This job is probably better done by the 
lexical analyzer. 


%j 

# include <stdio.h> 
§ include <ctype.h> 

int regs[26]; 
int base; 


%} 


% start list 

% token DIGIT LETTER 

%left T 
%left 
%left '+' 

%left 7' '%' 

%left UMINUS /* supplies precedence for unary minus */ 

%% />k beginning of rules section ♦/ 

list : /* empty */ 

I list stat '\n' 

I list error '\n' 

( yyerrok; ) 


stat : expr 

( printf( "%d\n", $1 ); ) 

I LETTER '=' expr 

{ regs[$l] = $3; ) 


( expr 

) 




f 


$$ 

= $2; 

) 

expr '+' 

expr 




f 


$$ 

= $1 

+ $3; 

expr 

expr 




i 


$$ 

= $1 

- $3; 
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expr '♦' expr 

1 $$ 

__ 

$1 5|e $3; 

expr 7' expr 

j $$ 

— 

$1 / $3; 

expr '%' expr 

# { $$ 

__ 

$1 % $3; 

expr expr 

( $$ 

= 

$1 & $3; 

expr T expr 

1 $$ 

_ 

$1 1 $3; ) 

expr %prec UMINUS 

( $$ 

= 

- $2; ) 

LETTER 

{ $$ 

_ 

regs[$l]; } 

number 




number: 
I 


DIGIT 

j $$ = $1; base = ($1 = =0) ? 8 : 10; j 

number DIGIT 

j $$ = base * $1 + $2; J 


%% /5k start of programs ♦/ 

yylex() j /+ lexical analysis routine ♦/ 

/+ returns LETTER for a lower case letter, yylval = 0 through 25 sk/ 
/* return DIGIT for a digit, yylval = 0 through 9 */ 

/* all other characters are returned immediately 5k/ 

int c; 

while( (c = getchar()) =="){/* skip blanks */ j 

/* c is now nonblank */ 

if( islower( c ) ) { 

yylval = c - 'a'; 
return ( LETTER ); 

) 

if( isdigit( c ) ) { 

yylval = c - 'O'; 
return! DIGIT ); 


return! c ); 
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Appendix B: Yacc Input Syntax 

This Appendix has a description of the Yacc input syntax, as a Yacc specification. Con¬ 
text dependencies, etc., are not considered. Ironically, the Yacc input specification language is 
most naturally specified as an LR(2) grammar; the sticky part comes when an identifier is seen 
in a rule, immediately following an action. If this identifier is followed by a colon, it is the 
start of the next rule; otherwise it is a continuation of the current rule, which just happens to 
have an action embedded in it. As implemented, the lexical analyzer looks ahead after seeing 
an identifier, and decide whether the next token (skipping blanks, newlines, comments, etc.) is 

a colon. If so, it returns the token C IDENTIFIER. Otherwise, it returns IDENTIFIER. 

Literals (quoted strings) are also returned as IDENTIFIERS, but never as part of 
C IDENTIFIERS. 


/+ grammar for the input to Yacc */ 

/♦ basic entities +/ 

%token IDENTIFIER /* includes identifiers and literals */ 

% token C IDENTIFIER /* identifier (but not literal) followed by colon */ 

% token NUMBER /* [0-9]+ */ 

/♦ reserved words: %type => TYPE, %left => LEFT, etc. *¥/ 

% token LEFT RIGHT NONASSOC TOKEN PREC TYPE START UNION 

% token MARK /* the %% mark */ 

%token LCURL /* the %j mark */ 

%token RCURL /* the %) mark */ 

/+ ascii character literals stand for themselves ♦/ 

% start spec 


%% 


spec 


defs MARK rules tail 


tail 


MARK { In this action, eat up the rest of the file ) 
/sk empty: the second MARK is optional ♦/ 


defs 


/>k empty +/ 
defs def 


def 


START IDENTIFIER 


UNION ( Copy union definition to output j 
LCURL { Copy C code to output file j RCURL 
ndefs rword tag nlist 


rword 


TOKEN 

LEFT 

RIGHT 





tag 


nlist 


nmno 


rules 


rule 


rbody 


act 


prec 


NONASSOC 

TYPE 


/♦ empty: union tag is optional */ 
< IDENTIFIER > 


nmno 

I nlist nmno 

I nlist V nmno 


IDENTIFIER /♦ NOTE: literal illegal with %type ♦/ 

IDENTIFIER NUMBER /* NOTE: illegal with %type */ 


/♦ rules section ♦/ 

C_IDENTIFIER rbody prec 

I rules rule 


C_IDENTIFIER rbody prec 

T rbody prec 


: /* empty */ 

I rbody IDENTIFIER 

I rbody act 


'(' { Copy action, translate $$, etc. ) ')' 


/+ empty ♦/ 

PREC IDENTIFIER 
PREC IDENTIFIER act 
prec V 
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Appendix C: An Advanced Example 

This Appendix gives an example of a grammar using some of the advanced features dis¬ 
cussed in Section 10. The desk calculator example in Appendix A is modified to provide a desk 
calculator that does floating point interval arithmetic. The calculator understands floating 
point constants, the arithmetic operations +, *, A unary and = (assignment), and has 

floating point variables, “a” through “z”. Moreover, it also understands intervals, written 

( x , y ) 

where x is less than or equal to y. There are 26 interval valued variables “A” through “Z” that 
may also be used. The usage is similar to that in Appendix A; assignments return no value, 
and print nothing, while expressions print the (floating or interval) value. 

This example explores a number of interesting features of Yacc and C. Intervals are 
represented by a structure, consisting of the left and right endpoint values, stored as double s. 
This structure is given a type name, INTERVAL, by using typedef. The Yacc value stack can 
also contain floating point scalars, and integers (used to index into the arrays holding the vari¬ 
able values) Notice that this entire strategy depends strongly on being able to assign struc¬ 
tures and unions in C. In fact, many of the actions call functions that return structures as 

well. 

It is also worth noting the use of YYERROR to handle error conditions: division by an 
interval containing 0, and an interval presented in the wrong order. In effect, the error 
recovery mechanism of Yacc is used to throw away the rest of the offending line. 

In addition to the mixing of types on the value stack, this grammar also demonstrates an 
interesting use of syntax to keep track of the type (e.g. scalar or interval) of intermediate 
expressions. Note that a scalar can be automatically promoted to an interval if the context 
demands an interval value. This causes a large number of conflicts when the grammar is run 
through Yacc: 18 Shift/Reduce and 26 Reduce/Reduce. The problem can be seen by looking at 
the two input lines: 

2.5 + ( 3.5 - 4. ) 

and 

2.5 + ( 3.5,4. ) 

Notice that the 2.5 is to be used in an interval valued expression in the second example, but 
this fact is not known until the is read; by this time, 2.5 is finished, and the parser cannot 
go back and change its mind. More generally, it might be necessary to look ahead an arbitrary 
number of tokens to decide whether to convert a scalar to an interval. This problem is evaded 
by having two rules for each binary interval valued operator: one when the left operand is a 
scalar and one when the left operand is an interval. In the second case, the right operand 
must be an interval, so the conversion will be applied automatically. Despite this evasion, 
there are still many cases where the conversion may be applied or not, leading to the above 
conflicts. They are resolved by listing the rules that yield scalars first in the specification file; 
in this way, the conflicts will be resolved in the direction of keeping scalar valued expressions 
scalar valued until they are forced to become intervals. 

This way of handling multiple types is very instructive, but not very general. If there 
were many kinds of expression types, instead of just two, the number of rules needed would 
increase dramatically, and the conflicts even more dramatically. Thus, while this example is 
instructive, it is better practice in a more normal programming language environment to keep 
the type information as part of the value, and not as part of the grammar. 

Finally, a word about the lexical analysis. The only unusual feature is the treatment of 
floating point constants. The C library routine atof is used to do the actual conversion from a 
character string to a double precision value. If the lexical analyzer detects an error, it responds 
by returning a token that is illegal in the grammar, provoking a syntax error in the parser, and 
thence error recovery. 
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%f 


i include <stdio.h> 

# include <ctype.h> 

typedef struct interval j 
double lo, hi; 
j INTERVAL; 

INTERVAL vmulO, vdiv(); 

double atof(); 

double dreg[ 26 ]; 
INTERVAL vreg[ 26 ]; 

%\ 


%start lines 

%union { 
int ival; 
double dval; 
INTERVAL vval; 


%token <ival> DREG VREG /♦ indices into dreg, vreg arrays */ 
%token <dval> CONST /♦ floating point constant ♦/ 

%type <dval> dexp /♦ expression ♦/ 

%type <vval> vexp /♦ interval expression ♦/ 

/♦ precedence information about the operators ♦/ 

%left'+' '-' 

%left'*' r 

%leftUMINUS /♦ precedence for unary minus ♦/ 

%% 

lines : /♦ empty ♦/ 

I lines line 


line : dexp An' 

j printf( "%15.8f\n", $1 ); ) 

I vexp An' 

j printf( ”(%15.8f , %15.8f )\n", $l.lo, $l.hi ); ) 
I DREG ' = ' dexp An' 

j dreg[$l] = $3; ) 

I VREG ' = ' vexp An' 
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dexp : 


vexp : 


error '\n' 


vreg[$l] = $3; 
yyerrok; ) 


CONST 

DREG 

( $$ = dreg[$l]; j 

dexp '+' dexp 

j $$ = $1 + $3; 

dexp dexp 

j $$ = $1 - $3; 

dexp dexp 

j $$ = $1 * $3; 

dexp 7' dexp 

j $$ = $1 / $3; | 

dexp %prec UMINUS 
( $$ = - $2; ) 

'(' dexp ')' 

! $$ = $2; i 


dexp 

{ $$.hi = $$.lo = $1; ) 

'(' dexp dexp T 

t 

$$.lo = $2; 

$$.hi = $4; 

if( $$.lo > $$.hi )j 

printf( "interval out of order\n" ); 
YYERROR; 


VREG 

j $$ = vreg[$l]; j 

vexp '+' vexp 

j $$.hi = $l.hi + $3.hi; 

$$.lo = $l.lo + $3.1o; | 

dexp '+' vexp 

{ $$.hi = $1 + $3.hi; 

$$.lo = $1 + $3.1o; | 

vexp vexp 

| $$.hi = $1 -hi - $3.1o; 

$$.lo = $l.lo - $3.hi; ) 

dexp vexp 

( $$.hi = $1 - $3.1o; 

$$.lo = $1 - $3.hi; | 
vexp vexp 

{ $$ = vmul( $l.lo, $1 .hi, $3 ); 

dexp vexp 

j $$ = vmul( $1, $1, $3 ); j 

vexp 7' vexp 

j if( dcheck( $3 ) ) YYERROR; 

$$ = vdiv( $l.lo, $l.hi, $3 ); 
dexp 7' vexp 
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j if( dcheck( $3 ) ) YYERROR; 

$$ = vdiv( $1, $1, $3 ); | 
vexp %prec UMINUS 
{ $$.hi = -$2.1o; $$.lo = -$2.hi; j 

T vexp ')' 

| $$ = $2; ) 


%% 

ft define BSZ 50 /+ buffer size for floating point numbers ♦/ 

/+ lexical analysis +/ 

yylex(){ 

register c; 

while( (c = getchar()) = =''){/♦ skip over blanks */ ( 

if( isupper( c ) ){ 

yylval.ival = c - 'A'; 
return( VREG ); 

) 

if( islower( c ) )j 

yylval.ival = c - 'a; 
return( DREG ); 


if( isdigit( c ) I c==Y ){ 

/+ gobble up digits, points, exponents +/ 

char buf[BSZ-fl], +cp = buf; 
int dot = 0, exp = 0; 

for( ; (cp-buf)<BSZ ; ++cp,c = getchar() ){ 

*cp = c; 

if( isdigit( c ) ) continue; 
if( c == V )j 

if( dot++ I exp ) return( Y ); /♦ will cause syntax error +/ 

continue; 


if( c == V )| 

if( exp-l--l- ) return( e' ); /♦ will cause syntax error */ 

continue; 


/+ end of number ♦/ 
break; 

) 

*cp = '\0'; 

if( (cp-buf) >= BSZ ) printf( "constant too long: truncated\n" ); 
else ungetc( c, stdin ); /* push back last char read */ 

yylval.dval = atof( buf ); 
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retum( CONST ); 


return( c ); 


INTERVAL hilo( a, b, c, d ) double a, b, c, d; ( 

/♦ returns the smallest interval containing a, b, c, and d ♦/ 
/♦ used by ♦, / routines */ 

INTERVAL v; 

if( a>b ) j v.hi = a; v.lo = b; } 
else { v.hi = b; v.lo = a; } 

if( c>d ) { 

if( c>v.hi ) v.hi = c; 
if( d<v.lo ) v.lo = d; 

) 

else { 

if( d>v.hi ) v.hi = d; 
if( c<v.lo ) v.lo = c; 

) 

return( v ); 


INTERVAL vmul( a, b, v ) double a, b; INTERVAL v; { 
return( hilo( a+v.hi, a+v.lo, b*v.hi, b+v.lo ) ); 


dcheck( v ) INTERVAL v; { 

if( v.hi >= 0. && v.lo <= 0. ){ 

printf( "divisor interval contains 0.\n" ); 
return ( 1 ); 

! 

return ( 0 ); 


INTERVAL vdiv( a, b, v ) double a, b; INTERVAL v; { 
return( hilo( a/v.hi, a/v.lo, b/v.hi, b/v.lo ) ); 
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Appendix D: Old Features Supported but not Encouraged 

This Appendix mentions synonyms and features which are supported for historical con¬ 
tinuity, but, for various reasons, are not encouraged. 

1. Literals may also be delimited by double quotes 

2. Literals may be more than one character long. If all the characters are alphabetic, 

numeric, or_, the type number of the literal is defined, just as if the literal did not have 

the quotes around it. Otherwise, it is difficult to find the value for such literals. 

The use of multi-character literals is likely to mislead those unfamiliar with Yacc, since it 
suggests that Yacc is doing a job which must be actually done by the lexical analyzer. 

3. Most places where % is legal, backslash “\” may be used. In particular, \\ is the same as 
%%> \left the same as %left, etc. 

4. There are a number of other synonyms: 

%< is the same as %left 

%> is the same as % right 

%binary and %2 are the same as %nonassoc 

%0 and %term are the same as %token 

% = is the same as %prec 

5. Actions may also have the form 

=!•••) 

and the curly braces can be dropped if the action is a single C statement. 

6. C code between %{ and %\ used to be permitted at the head of the rules section, as well 
as in the declaration section. 
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Lex - A Lexical Analyzer Generator 

M. E. Lesk and E. Schmidt 


ABSTRACT 


Lex helps write programs whose control flow is directed by instances of 
regular expressions in the input stream. It is well suited for editor-script type 
transformations and for segmenting input in preparation for a parsing routine. 

Lex source is a table of regular expressions and corresponding program 
fragments. The table is translated to a program which reads an input stream, 
copying it to an output stream and partitioning the input into strings which 
match the given expressions. As each such string is recognized the correspond¬ 
ing program fragment is executed. The recognition of the expressions is per¬ 
formed by a deterministic finite automaton generated by Lex. The program 
fragments written by the user are executed in the order in which the 
corresponding regular expressions occur in the input stream. 

The lexical analysis programs written with Lex accept ambiguous 
specifications and choose the longest match possible at each input point. If 
necessary, substantial lookahead is performed on the input, but the input 
stream will be backed up to the end of the current partition, so that the user 
has general freedom to manipulate it. 

Lex can generate analyzers in either C or Ratfor, a language which can be 
translated automatically to portable Fortran. It is available on the PDP-11 
UNIX, Honeywell GCOS, and IBM OS systems. This manual, however, will 
only discuss generating analyzers in C on the UNIX system, which is the only 
supported form of Lex under UNIX Version 7. Lex is designed to simplify 
interfacing with Yacc, for those with access to this compiler-compiler system. 


July 21, 1975 
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1. Introduction. 

Lex is a program generator designed 
for lexical processing of character input 
streams. It accepts a high-level, problem 
oriented specification for character string 
matching, and produces a program in a gen¬ 
eral purpose language which recognizes reg¬ 
ular expressions. The regular expressions 
are specified by the user in the source 
specifications given to Lex. The Lex writ¬ 
ten code recognizes these expressions in an 
input stream and partitions the input 
stream into strings matching the expres¬ 
sions. At the boundaries between strings 
program sections provided by the user are 
executed. The Lex source file associates the 
regular expressions and the program frag¬ 
ments. As each expression appears in the 
input to the program written by Lex, the 
corresponding fragment is executed. 

The user supplies the additional code 
beyond expression matching needed to com¬ 
plete his tasks, possibly including code writ¬ 
ten by other generators. The program that 
recognizes the expressions is generated in 
the general purpose programming language 
employed for the user's program fragments. 
Thus, a high level expression language is 
provided to write the string expressions to 


be matched while the user's freedom to 
write actions is unimpaired. This avoids 
forcing the user who wishes to use a string 
manipulation language for input analysis to 
write processing programs in the same and 
often inappropriate string handling 
language. 

Lex is not a complete language, but 
rather a generator representing a new 
language feature which can be added to 
different programming languages, called 
“host languages.” Just as general purpose 
languages can produce code to run on 
different computer hardware, Lex can write 
code in different host languages. The host 
language is used for the output code gen¬ 
erated by Lex and also for the program 
fragments added by the user. Compatible 
run-time libraries for the different host 
languages are also provided. This makes 
Lex adaptable to different environments and 
different users. Each application may be 
directed to the combination of hardware 
and host language appropriate to the task, 
the user’s background, and the properties of 
local implementations. At present, the only 
supported host language is C, although For¬ 
tran (in the form of Ratfor [2] has been 
available in the past. Lex itself exists on 
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UNIX, GCOS, and OS/370; but the code 
generated by Lex may be taken anywhere 
the appropriate compilers exist. 

Lex turns the user’s expressions and 
actions (called source in this memo) into 
the host general-purpose language; the gen¬ 
erated program is named yylex. The yylex 
program will recognize expressions in a 
stream (called input in this memo) and per¬ 
form the specified actions for each expres¬ 
sion as it is detected. See Figure 1. 


Source —► 


Lex 


yylex 


Input —► 


yylex 


-► Output 


An overview of Lex 


Figure 1 

For a trivial example, consider a pro¬ 
gram to delete from the input all blanks or 
tabs at the ends of lines. 

%% 

[ \t]+$ ; 

is all that is required. The program con¬ 
tains a %% delimiter to mark the beginning 
of the rules, and one rule. This rule con¬ 
tains a regular expression which matches 
one or more instances of the characters 
blank or tab (written \t for visibility, in 
accordance with the C language convention) 
just prior to the end of a line. The brackets 
indicate the character class made of blank 
and tab; the + indicates “one or more 
and the $ indicates “end of line,” as in 
QED. No action is specified, so the pro¬ 
gram generated by Lex (yylex) will ignore 
these characters. Everything else will be 
copied. To change any remaining string of 
blanks or tabs to a single blank, add another 
rule: 

%% 

[ \t]+$ ; 

[ \t]+ printf(" "); 

The finite automaton generated for this 
source will scan for both rules at once, 
observing at the termination of the string of 
blanks or tabs whether or not there is a 
newline character, and executing the desired 
rule action. The first rule matches all 
strings of blanks or tabs at the end of lines, 
and the second rule all remaining strings of 
blanks or tabs. 


Lex can be used alone for simple 
transformations, or for analysis and statis¬ 
tics gathering on a lexical level. Lex can 
also be used with a parser generator to per¬ 
form the lexical analysis phase; it is particu¬ 
larly easy to interface Lex and Yacc [3]. 
Lex programs recognize only regular expres¬ 
sions; Yacc writes parsers that accept a 
large class of context free grammars, but 
require a lower level analyzer to recognize 
input tokens. Thus, a combination of Lex 
and Yacc is often appropriate. When used 
as a preprocessor for a later parser genera¬ 
tor, Lex is used to partition the input 
stream, and the parser generator assigns 
structure to the resulting pieces. The flow 
of control in such a case (which might be 
the first half of a compiler, for example) is 
shown in Figure 2. Additional programs, 
written by other generators or by hand, can 
be added easily to programs written by Lex. 
lexical grammar 

rules rules 



Parsed input 
Lex with Yacc 


Figure 2 

Yacc users will realize that the name yylex 
is what Yacc expects its lexical analyzer to 
be named, so that the use of this name by 
Lex simplifies interfacing. 

Lex generates a deterministic finite 
automaton from the regular expressions in 
the source [4]. The automaton is inter¬ 
preted, rather, than compiled, in order to 
save space. The result is still a fast 
analyzer. In particular, the time taken by a 
Lex program to recognize and partition an 
input stream is proportional to the length of 
the input. The number of Lex rules or the 
complexity of the rules is not important in 
determining speed, unless rules which 
include forward context require a significant 
amount of rescanning. What does increase 
with the number and complexity of rules is 
the size of the finite automaton, and there¬ 
fore the size of the program generated by 
Lex. 
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In the program written by Lex, the 
user’s fragments (representing the actions to 
be performed as each regular expression is 
found) are gathered as cases of a switch. 
The automaton interpreter directs the con¬ 
trol flow. Opportunity is provided for the 
user to insert either declarations or addi¬ 
tional statements in the routine containing 
the actions, or to add subroutines outside 
this action routine. 

Lex is not limited to source which can 
be interpreted on the basis of one character 
lookahead. For example, if there are two 
rules, one looking for ab and another for 
abcdefgy and the input stream is abcdefh , 
Lex will recognize ab and leave the input 
pointer just before cd. . . Such backup is 
more costly than the processing of simpler 
languages. 

2. Lex Source. 

The general format of Lex source is: 
{definitions) 

%% 

{rules) 

%% 

{user subroutines) 

where the definitions and the user subrou¬ 
tines are often omitted. The second %% is 
optional, but the first is required to mark 
the beginning of the rules. The absolute 
minimum Lex program is thus 
%% 

(no definitions, no rules) which translates 
into a program which copies the input to 
the output unchanged. 

In the outline of Lex programs shown 
above, the rules represent the user’s control 
decisions; they are a table, in which the left 
column contains regular expressions (see 
section 3) and the right column contains 
actions , program fragments to be executed 
when the expressions are recognized. Thus 
an individual rule might appear 

integer printf("found keyword INT"); 
to look for the string integer in the input 
stream and print the message “found key¬ 
word INT” whenever it appears. In this 
example the host procedural language is C 
and the C library function printf is used to 
print the string. The end of the expression 
is indicated by the first blank or tab charac¬ 
ter. If the action is merely a single C 
expression, it can just be given on the right 


side of the line; if it is compound, or takes 
more than a line, it should be enclosed in 
braces. As a slightly more useful example, 
suppose it is desired to change a number of 
words from British to American spelling. 
Lex rules such as 

colour printf (" color "); 

mechanise printf (" mechanize "); 

petrol printf (" gas "); 

would be a start. These rules are not quite 
enough, since the word petroleum would 
become gaseum ; a way of dealing with this 
will be described later. 

3. Lex Regular Expressions. 

The definitions of regular expressions 
are very similar to those in QED [5]. A reg¬ 
ular expression specifies a set of strings to 
be matched. It contains text characters 
(which match the corresponding characters 
in the strings being compared) and operator 
characters (which specify repetitions, 
choices, and other features). The letters of 
the alphabet and the digits are always text 
characters; thus the regular expression 
integer 

matches the string integer wherever it 
appears and the expression 
a57D 

looks for the string a57D. 

Operators. The operator characters 
are 

"\[] A -?.*+l()$/{}%<> 
and if they are to be used as text characters, 
an escape should be used. The quotation 
mark operator (") indicates that whatever is 
contained between a pair of quotes is to be 
taken as text characters. Thus 
xyz"++" 

matches the string xyz++ when it appears. 
Note that a part of a string may be quoted. 
It is harmless but unnecessary to quote an 
ordinary text character; the expression 
"xyz++" 

is the same as the one above. Thus by 
quoting every non-alphanumeric character 
being used as a text character, the user can 
avoid remembering the list above of current 
operator characters, and is safe should 
further extensions to Lex lengthen the list. 

An operator character may also be 
turned into a text character by preceding it 
with \ as in 

xyz\+\+ 
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which is another, less readable, equivalent 
of the above expressions. Another use of 
the quoting mechanism is to get a blank 
into an expression; normally, as explained 
above, blanks or tabs end a rule. Any blank 
character not contained within [] (see 
below) must be quoted. Several normal C 
escapes with \ are recognized: \n is newline, 
\t is tab, and \b is backspace. To enter \ 
itself, use \\. Since newline is illegal in an 
expression, \n must be used; it is not 
required to escape tab and backspace. 
Every character but blank, tab, newline and 
the list above is always a text character. 

Character classes. Classes of charac¬ 
ters can be specified using the operator pair 
[]. The construction fabc] matches a single 
character, which may be a, 6, or c. Within 
square brackets, most operator meanings 
are ignored. Only three characters are spe¬ 
cial: these are \ — and \ The — character 
indicates ranges. For example, 

[a — zO — 90_] 

indicates the character class containing all 
the lower case letters, the digits, the angle 
brackets, and underline. Ranges may be 
given in either order. Using — between any 
pair of characters which are not both upper 
case letters, both lower case letters, or both 
digits is implementation dependent and will 
get a warning message. (E.g., [0-z] in 
ASCII is many more characters than it is in 
EBCDIC). If it is desired to include the 
character — in a character class, it should 
be first or last; thus 

[-+0-9] 

matches all the digits and the two signs. 

In character classes, the " operator 
must appear as the first character after the 
left bracket; it indicates that the resulting 
string is to be complemented with respect to 
the computer character set. Thus 
fabc] 

matches all characters except a, b, or c, 
including all special or control characters; or 
["a-zA-Z] 

is any character which is not a letter. The \ 
character provides the usual escapes within 
character class brackets. 

Arbitrary character. To match almost 
any character, the operator character 

is the class of all characters except newline. 
Escaping into octal is possible although 


non-portable: 

[\40-\176] 

matches all printable characters in the 
ASCII character set, from octal 40 (blank) 
to octal 176 (tilde). 

Optional expressions. The operator ? 
indicates an optional element of an expres¬ 
sion. Thus 

ab?c 

matches either ac or abc. 

Repeated expressions. Repetitions of 
classes are indicated by the operators * and 
+. 

a* 

is any number of consecutive a characters, 
including zero; while 

a+ 

is one or more instances of a. For example, 
[a-z]+ 

is all strings of lower case letters. And 
[A —Za-z][A-Za-zO-9]* 
indicates all alphanumeric strings with a 
leading alphabetic character. This is a typi¬ 
cal expression for recognizing identifiers in 
computer languages. 

Alternation and Grouping. The 
operator I indicates alternation: 

(ablcd) 

matches either ab or cd. Note that 
parentheses are used for grouping, although 
they are not necessary on the outside level; 
ablcd 

would have sufficed. Parentheses can be 
used for more complex expressions: 

(ablcd+)?(ef)* 

matches such strings as abefef , efefef , cdef , 
or cddd ; but not abc , abcd f or abcdef. 

Context sensitivity. Lex will recognize 
a small amount of surrounding context. 
The two simplest operators for this are 
and $. If the first character of an expres¬ 
sion is ", the expression will only be 
matched at the beginning of a line (after a 
newline character, or at the beginning of the 
input stream). This can never conflict with 
the other meaning of ", complementation of 
character classes, since that only applies 
within the [] operators. If the very last 
character is $, the expression will only be 
matched at the end of a line (when immedi¬ 
ately followed by newline). The latter 
operator is a special case of the / operator 
character, which indicates trailing context. 
The expression 
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ab/cd 

matches the string ab , but only if followed 
by cd. Thus 

ab$ 

is the same as 

abAn 

Left context is handled in Lex by start con¬ 
ditions as explained in section 10. If a rule 
is only to be executed when the Lex auto¬ 
maton interpreter is in start condition x , the 
rule should be prefixed by 
<x> 

using the angle bracket operator characters. 
If we considered ‘‘being at the beginning of 
a line” to be start condition ONE , then the 
* operator would be equivalent to 
<ONE> 

Start conditions are explained more fully 
later. 

Repetitions and Definitions. The 
operators {} specify either repetitions (if 
they enclose numbers) or definition expan¬ 
sion (if they enclose a name). For example 
{digit} 

looks for a predefined string named digit 
and inserts it at that point in the expres¬ 
sion. The definitions are given in the first 
part of the Lex input, before the rules. In 
contrast, 

a{l,5) 

looks for 1 to 5 occurrences of a. 

Finally, initial % is special, being the 
separator for Lex source segments. 

4. Lex Actions. 

When an expression written as above 
is matched, Lex executes the corresponding 
action. This section describes some features 
of Lex which aid in writing actions. Note 
that there is a default action, which consists 
of copying the input to the output. This is 
performed on all strings not otherwise 
matched. Thus the Lex user who wishes to 
absorb the entire input, without producing 
any output, must provide rules to match 
everything. When Lex is being used with 
Yacc, this is the normal situation. One may 
consider that actions are what is done 
instead of copying the input to the output; 
thus, in general, a rule which merely copies 
can be omitted. Also, a character combina¬ 
tion which is omitted from the rules and 
which appears as input is likely to be 
printed on the output, thus calling attention 


to the gap in the rules. 

One of the simplest things that can be 
done is to ignore the input. Specifying a C 
null statement, ; as an action causes this 
result. A frequent rule is 
[ \t\n] 

which causes the three spacing characters 
(blank, tab, and newline) to be ignored. 

Another easy way to avoid writing 
actions is the action character I, which indi¬ 
cates that the action for this rule is the 
action for the next rule. The previous 
example could also have been written 

tt »* 

"\t" 

"\n" 

with the same result, although in different 
style. The quotes around \n and \t are not 
required. 

In more complex actions, the user will 
often want to know the actual text that 
matched some expression like [a — z]+. Lex 
leaves this text in an external character 
array named yytext. Thus, to print the 
name found, a rule like 

[a-z]+ printf("%s", yytext); 
will print the string in yytext. The C func¬ 
tion printf accepts a format argument and 
data to be printed; in this case, the format 
is “print string” (% indicating data conver¬ 
sion, and s indicating string type), and the 
data are the characters in yytext. So this 
just places the matched string on the out¬ 
put. This action is so common that it may 
be written as ECHO: 

[a-z]+ ECHO; 

is the same as the above. Since the default 
action is just to print the characters found, 
one might ask why give a rule, like this one, 
which merely specifies the default action? 
Such rules are often required to avoid 
matching some other rule which is not 
desired. For example, if there is a rule 
which matches read it will normally match 
the instances of read contained in bread or 
readjust ; to avoid this, a rule of the form 
[a — z]+ is needed. This is explained 
further below. 

Sometimes it is more convenient to 
know the end of what has been found; hence 
Lex also provides a count yyleng of the 
number of characters matched. To count 
both the number of words and the number 
of characters in words in the input, the user 


202 




might write 

[a-zA-Z]+ (words++; chars += yyleng;) 
which accumulates in chars the number of 
characters in the words recognized. The 
last character in the string matched can be 
accessed by 

yytext [yyleng-1] 

Occasionally, a Lex action may decide 
that a rule has not recognized the correct 
span of characters. Two routines are pro¬ 
vided to aid with this situation. First, 
yymore() can be called to indicate that the 
next input expression recognized is to be 
tacked on to the end of this input. Nor¬ 
mally, the next input string would overwrite 
the current entry in yytext. Second, yyless 
(n) may be called to indicate that not all the 
characters matched by the currently suc¬ 
cessful expression are wanted right now. 
The argument n indicates the number of 
characters in yytext to be retained. Further 
characters previously matched are returned 
to the input. This provides the same sort of 
lookahead offered by the / operator, but in a 
different form. 

Example: Consider a language which 
defines a string as a set of characters 
between quotation (") marks, and provides 
that to include a " in a string it must be 
preceded by a \. The regular expression 
which matches that is somewhat confusing, 
so that it might be preferable to write 

\TT 1 

if (yytext[yyleng-1 ] = = '\Y) 
yymoreO; 

else 

... normal user processing 

i 

which will, when faced with a string such as 
''abc\ n def' first match the five characters 
"aftc\; then the call to yymore() will cause 
the next part of the string, "de/, to be 
tacked on the end. Note that the final 
quote terminating the string should be 
picked up in the code labeled “normal pro¬ 
cessing”. 

The function yyless() might be used to 
reprocess text in various circumstances. 
Consider the C problem of distinguishing 
the ambiguity of “=-a”. Suppose it is 
desired to treat this as “ = — a” but print a 
message. A rule might be 
= — [a-zA-Z] { 

printf("Operator (=—) ambiguous\n"); 


yyless(yyleng-l); 

... action for = — ... 

I 

which prints a message, returns the letter 
after the operator to the input stream, and 
treats the operator as Alternatively 

it might be desired to treat this as “ = 
— a”. To do this, just return the minus sign 
as well as the letter to the input: 

= - [a-zA-Z] ( 

printf("Operator (=—) ambiguous\n"); 
yyless(yyleng-2); 

... action for = ... 

) 

will perform the other interpretation. Note 
that the expressions for the two cases might 
more easily be written 

= — /[A-Za-z] 
in the first case and 

=/-[A-Za-z] 

in the second; no backup would be required 
in the rule action. It is not necessary to 
recognize the whole identifier to observe the 
ambiguity. The possibility of “ = — 3”, how¬ 
ever, makes 

—/r \t\n] 

a still better rule. 

In addition to these routines, Lex also 
permits access to the I/O routines it uses. 
They are: 

1) input() which returns the next input 
character; 

2) output(c) which writes the character c 
on the output; and 

3) unput(c) pushes the character c back 
onto the input stream to be read later 
by input(). 

By default these routines are provided as 
macro definitions, but the user can override 
them and supply private versions. These 
routines define the relationship between 
external files and internal characters, and 
must all be retained or modified con¬ 
sistently. They may be redefined, to cause 
input or output to be transmitted to or from 
strange places, including other programs or 
internal memory; but the character set used 
must be consistent in all routines; a value of 
zero returned by input must mean end of 
file; and the relationship between unput and 
input must be retained or the Lex look¬ 
ahead will not work. Lex does not look 
ahead at all if it does not have to, but every 
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rule ending in + * ? or ^ or containing / 
implies lookahead. Lookahead is also neces¬ 
sary to match an expression that is a prefix 
of another expression. See below for a dis¬ 
cussion of the character set used by Lex. 
The standard Lex library imposes a 100 
character limit on backup. 

Another Lex library routine that the 
user will sometimes want to redefine is 
yywrap() which is called whenever Lex 
reaches an end-of-file. If yywrap returns a 
1, Lex continues with the normal wrapup on 
end of input. Sometimes, however, it is 
convenient to arrange for more input to 
arrive from a new source. In this case, the 
user should provide a yywrap which 
arranges for new input and returns 0. This 
instructs Lex to continue processing. The 
default yywrap always returns 1. 

This routine is also a convenient place 
to print tables, summaries, etc. at the end of 
a program. Note that it is not possible to 
write a normal rule which recognizes end- 
of-file; the only access to this condition is 
through yywrap. In fact, unless a private 
version of input() is supplied a file contain¬ 
ing nulls cannot be handled, since a value of 
0 returned by input is taken to be end-of- 
file. 

5. Ambiguous Source Rules. 

Lex can handle ambiguous 
specifications. When more than one expres¬ 
sion can match the current input, Lex 
chooses as follows: 

1) The longest match is preferred. 

2) Among rules which matched the same 
number of characters, the rule given 
first is preferred. 

Thus, suppose the rules 

integer keyword action ...; 

[a-z]+ identifier action ...; 
to be given in that order. If the input is 
integers , it is taken as an identifier, because 
[a-z]+ matches 8 characters while integer 
matches only 7. If the input is integer , both 
rules match 7 characters, and the keyword 
rule is selected because it was given first. 
Anything shorter (e.g. int) will not match 
the expression integer and so the identifier 
interpretation is used. 

The principle of preferring the longest 
match makes rules containing expressions 


like . * dangerous. For example, 

> */ 

might seem a good way of recognizing a 
string in single quotes. But it is an invita¬ 
tion for the program to read far ahead, 
looking for a distant single quote. 

Presented with the input 
'first' quoted string here, 'second' here 
the above expression will match 

'first' quoted string here, 'second' 
which is probably not what was wanted. A 
better rule is of the form 
T'\n]*' 

which, on the above input, will stop after 
'first'. The consequences of errors like this 
are mitigated by the fact that the . operator 
will not match newline. Thus expressions 
like . * stop on the current line. Don’t try to 
defeat this with expressions like [.\n]+ or 
equivalents; the Lex generated program will 
try to read the entire input file, causing 
internal buffer overflows. 

Note that Lex is normally partitioning 
the input stream, not searching for all possi¬ 
ble matches of each expression. This means 
that each character is accounted for once 
and only once. For example, suppose it is 
desired to count occurrences of both she 
and he in an input text. Some Lex rules to 
do this might be 

she s++; 

he h++; 

\n I 

• t • 

where the last two rules ignore everything 
besides he and she. Remember that . does 
not include newline. Since she includes he, 
Lex will normally not recognize the 
instances of he included in she , since once it 
has passed a she those characters are gone. 

Sometimes the user would like to over¬ 
ride this choice. The action REJECT 
means “go do the next alternative.” It 
causes whatever rule was second choice 
after the current rule to be executed. The 
position of the input pointer is adjusted 
accordingly. Suppose the user really wants 
to count the included instances of he: 
she js++; REJECT;) 

he |h++; REJECT;) 

\n I 

• > 

these rules are one way of changing the pre¬ 
vious example to do just that. After count- 
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ing each expression, it is rejected; whenever 
appropriate, the other expression will then 
be counted. In this example, of course, the 
user could note that she includes he but not 
vice versa, and omit the REJECT action on 
he; in other cases, however, it would not be 
possible a priori to tell which input charac¬ 
ters were in both classes. 

Consider the two rules 

a[bc]+ { ... ; REJECT;) 
a[cd]+ { ... ; REJECT;) 

If the input is ab , only the first rule 
matches, and on ad only the second 
matches. The input string accb matches the 
first rule for four characters and then the 
second rule for three characters. In con¬ 
trast, the input accd agrees with the second 
rule for four characters and then the first 
rule for three. 

In general, REJECT is useful when¬ 
ever the purpose of Lex is not to partition 
the input stream but to detect all examples 
of some items in the input, and the 
instances of these items may overlap or 
include each other. Suppose a digram table 
of the input is desired; normally the digrams 
overlap, that is the word the is considered 
to contain both th and he. Assuming a 
two-dimensional array named digram to be 
incremented, the appropriate source is 
%% 

[ a -z] [a-z] 

{digram[yytext[0] ] [yytext[ 1 ] ]++; REJECT;) 

\n ; 

where the REJECT is necessary to pick up 
a letter pair beginning at every character, 
rather than at every other character. 

6. Lex Source Definitions. 

Remember the format of the Lex 
source: 

(definitions) 

%% 

{rules) 

%% 

{user routines) 

So far only the rules have been described. 
The user needs additional options, though, 
to define variables for use in his program 
and for use by Lex. These can go either in 
the definitions section or in the rules sec¬ 
tion. 







Remember that Lex is turning the 
rules into a program. Any source not inter¬ 
cepted by Lex is copied into the generated 
program. There are three classes of such 
things. 

1) Any line which is not part of a Lex 
rule or action which begins with a 
blank or tab is copied into the Lex 
generated program. Such source input 
prior to the first %% delimiter will be 
external to any function in the code; if 
it appears immediately after the first 
%%, it appears in an appropriate place 
for declarations in the function written 
by Lex which contains the actions. 
This material must look like program 
fragments, and should precede the first 
Lex rule. 

As a side effect of the above, lines 
which begin with a blank or tab, and 
which contain a comment, are passed 
through to the generated program. 
This can be used to include comments 
in either the Lex source or the gen¬ 
erated code. The comments should 
follow the host language convention. 

2) Anything included between lines con¬ 
taining only %{ and %} is copied out as 
above. The delimiters are discarded. 
This format permits entering text like 
preprocessor statements that must 
begin in column 1, or copying lines 
that do not look like programs. 

3) Anything after the third %% delimiter, 
regardless of formats, etc., is copied 
out after the Lex output. 

Definitions intended for Lex are given 
before the first %% delimiter. Any line in 
this section not contained between %{ and 
%}, and begining in column 1, is assumed to 
define Lex substitution strings. The format 
of such lines is 

name translation 

and it causes the string given as a transla¬ 
tion to be associated with the name. The 
name and translation must be separated by 
at least one blank or tab, and the name 
must begin with a letter. The translation 
can then be called out by the {name) syntax 
in a rule. Using (D) for the digits and (E) 
for an exponent field, for example, might 
abbreviate rules to recognize numbers: 

D [0-9] 
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E [DEde][-+]?{D)+ 

%% 

jD j ■+ printf (” integer”); 

|D}+”.”{D)*({E})? I 
|D}*”.”{D}+({E})? I 
{D}+{Ej 

Note the first two rules for real numbers; 
both require a decimal point and contain an 
optional exponent field, but the first 
requires at least one digit before the decimal 
point and the second requires at least one 
digit after the decimal point. To correctly 
handle the problem posed by a Fortran 
expression such as 35.EQ.I, which does not 
contain a real number, a context-sensitive 
rule such as 

[0-9]+/”.”EQ printf (” integer ”); 

could be used in addition to the normal rule 
for integers. 

The definitions section may also con¬ 
tain other commands, including the selec¬ 
tion of a host language, a character set 
table, a list of start conditions, or adjust¬ 
ments to the default size of arrays within 
Lex itself for larger source programs. These 
possibilities are discussed below under 
“Summary of Source Format,” section 12. 

7. Usage. 

There are two steps in compiling a 
Lex source program. First, the Lex source 
must be turned into a generated program in 
the host general purpose language. Then 
this program must be compiled and loaded, 
usually with a library of Lex subroutines. 
The generated program is on a file named 
lex.yy.c. The I/O library is defined in terms 
of the C standard library [6]. 

The C programs generated by Lex are 
slightly different on OS/370, because the OS 
compiler is less powerful than the UNIX or 
GCOS compilers, and does less at compile 
time. C programs generated on GCOS and 
UNIX are the same. 

UNIX. The library is accessed by the 
loader flag -ll. So an appropriate set of 
commands is 

lex source cc lex.yy.c -11 
The resulting program is placed on the 
usual file a.out for later execution. To use 
Lex with Yacc see below. Although the 
default Lex I/O routines use the C standard 
library, the Lex automata themselves do not 
do so; if private versions of input , output 


and unput are given, the library can be 
avoided. 

8. Lex and Yacc. 

If you want to use Lex with Yacc, note 
that what Lex writes is a program named 
yylex(), the name required by Yacc for its 
analyzer. Normally, the default main pro¬ 
gram on the Lex library calls this routine, 
but if Yacc is loaded, and its main program 
is used, Yacc will call yylex(). In this case 
each Lex rule should end with 
return(token); 

where the appropriate token value is 
returned. An easy way to get access to 
Yacc’s names for tokens is to compile the 
Lex output file as part of the Yacc output 
file by placing the line 

# include ”lex.yy.c” 

in the last section of Yacc input. Supposing 
the grammar to be named “good” and the 
lexical rules to be named “better” the UNIX 
command sequence can just be: 
yacc good 
lex better 
cc y.tab.c -ly -11 

The Yacc library (-ly) should be loaded 
before the Lex library, to obtain a main pro¬ 
gram which invokes the Yacc parser. The 
generations of Lex and Yacc programs can 
be done in either order. 

9. Examples. 

As a trivial problem, consider copying 
an input file while adding 3 to every positive 
number divisible by 7. Here is a suitable 
Lex source program 

%% 

int k; 

[0-9]+ j 

k = atoi(yytext); 
if (k%7 = = 0) 

printf("%d", k+3); 

else 

printf("%d",k); 

I 

to do just that. The rule [0-9]+ recognizes 
strings of digits; atoi converts the digits to 
binary and stores the result in k. The 
operator % (remainder) is used to check 
whether k is divisible by 7; if it is, it is 
incremented by 3 as it is written out. It 
may be objected that this program will alter 
such input items as 49.63 or X7. Further¬ 
more, it increments the absolute value of all 
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negative numbers divisible by 7. To avoid 
this, just add a few more rules after the 
active one, as here: 

%% 

int k; 

-?[0-9]+ { 

k = atoi(yytext); 

printf("%d", 

k%7 = = 0 ? k+3 : k); 

) 

-?[0-9.]+ ECHO; 

[ A-Za-z] [ A-Za-zO-9] + ECHO; 

Numerical strings containing a or pre¬ 
ceded by a letter will be picked up by one of 
the last two rules, and not changed. The 
if-else has been replaced by a C conditional 
expression to save space; the form a?b:c 
means “if a then b else c”. 

For an example of statistics gathering, 
here is a program which histograms the 
lengths of words, where a word is defined as 
a string of letters. 

int lengs[100]; 

%% 

[a-z]+ lengs [yyleng]++; 

I 

\n ; 

%% 

yywrapO 

{ 

int i; 

printf("Length No. words\n"); 

for(i=0; i<100; i++) 
if (lengs[i] > 0) 

printf("%5d%10d\n",i,lengs[i]); 

return(l); 

} 

This program accumulates the histogram, 
while producing no output. At the end of 
the input it prints the table. The final 
statement return(l); indicates that Lex is to 
perform wrapup. If yywrap returns zero 
(false) it implies that further input is avail¬ 
able and the program is to continue reading 
and processing. To provide a yywrap that 
never returns true causes an infinite loop. 

As a larger example, here are some 
parts of a program written by N. L. Schryer 
to convert double precision Fortran to single 
precision Fortran. Because Fortran does 
not distinguish upper and lower case letters, 
this routine begins by defining a set of 
classes including both cases of each letter: 
a [aA] 


b [bB] 
c [cC] 

z [zZ] 

An additional class recognizes white space:. 

W [ \t]* 

The first rule changes “double precision” to 
“real”, or “DOUBLE PRECISION” to 
“REAL” 

(d|(o){u)(b)(l){e}(W}{p){r)(e)(c)(i){s){i){o)(n) ( 

printf(yytext[0] = ='d'? "real" : "REAL"); 

1 

Care is taken throughout this program to 
preserve the case (upper or lower) of the 
original program. The conditional operator 
is used to select the proper form of the key¬ 
word. The next rule copies continuation 
card indications to avoid confusing them 
with constants: 

"f 0] ECHO; 

In the regular expression, the quotes sur¬ 
round the blanks. It is interpreted as 
“beginning of line, then five blanks, then 
anything but blank or zero.” Note the two 
different meanings of \ There follow some 
rules to change double precision constants 
to ordinary floating constants. 
[0-9]+(W}{d}(W)[+-]?{W)[0-9]+ I 
[0-9]+{W)"."{W)jd}{W)[+-]?(W}[0-9]+ I 
"."(WJ[0-9]+{W){d}{W}[+-]?(W}[0-9]+ ( 

/* convert constants */ 
for(p=yytext; *p != 0; p++) 

{ 

if (* p = = 'd' I *p = = 'D') 

*p = + V- 'd'; 

ECHO; 

1 

After the floating point constant is recog¬ 
nized, it is scanned by the for loop to find 
the letter d or D. The program than adds 
'e'-'d', which converts it to the next letter 
of the alphabet. The modified constant, 
now single-precision, is written out again. 
There follow a series of names which must 
be respelled to remove their initial d. By 
using the array yytext the same action 
suffices for all the names (only a sample of a 
rather long list is given here). 

(d)(s}{i}{n) I 

|d){c)(oj{s) I 

fd){s}{q){r){t} I 

(d){a){tj{a}{n) I 

jd){f){l)(o)(a){t) printf(" % s ",yytext+1); 

Another list of names must have initial d 
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changed to initial a: 

(d)(l}|o)|g) I 
(d|(l)jo||g)10 I 
(d)|m)ji){n}l I 
|d|(m){a)(x)l { 

yytext[0] = + 'a' -'d'; 
ECHO; 

) 

And one routine must have initial d 
changed to initial r. 

{d)l(mj{a}jc){h} (yytext[0] =+ V -'d'; 

To avoid such names as dsinx being 
detected as instances of dsin , some final 
rules pick up longer words as identifiers and 
copy some surviving characters: 
[A-Za-z][A-Za-zO-9]* I 
[0-9]+ I 

\n I 

ECHO; 

Note that this program is not complete; it 
does not deal with the spacing problems in 
Fortran or with the use of keywords as 
identifiers. 

10. Left Context Sensitivity. 

Sometimes it is desirable to have 
several sets of lexical rules to be applied at 
different times in the input. For example, a 
compiler preprocessor might distinguish 
preprocessor statements and analyze them 
differently from ordinary statements. This 
requires sensitivity to prior context, and 
there are several ways of handling such 
problems. The A operator, for example, is a 
prior context operator, recognizing immedi¬ 
ately preceding left context just as $ recog¬ 
nizes immediately following right context. 
Adjacent left context could be extended, to 
produce a facility similar to that for adja¬ 
cent right context, but it is unlikely to be as 
useful, since often the relevant left context 
appeared some time earlier, such as at the 
beginning of a line. 

This section describes three means of 
dealing with different environments: a sim¬ 
ple use of flags, when only a few rules 
change from one environment to another, 
the use of start conditions on rules, and the 
possibility of making multiple lexical 
analyzers all run together. In each case, 
there are rules which recognize the need to 
change the environment in which the fol¬ 
lowing input text is analyzed, and set some 


parameter to reflect the change. This may 
be a flag explicitly tested by the user’s 
action code; such a flag is the simplest way 
of dealing with the problem, since Lex is 
not involved at all. It may be more con¬ 
venient, however, to have Lex remember the 
flags as initial conditions on the rules. Any 
rule may be associated with a start condi¬ 
tion. It will only be recognized when Lex is 
in that start condition. The current start 
condition may be changed at any time. 
Finally, if the sets of rules for the different 
environments are very dissimilar, clarity 
may be best achieved by writing several dis¬ 
tinct lexical analyzers, and switching from 
one to another as desired. 

Consider the following problem: copy 
the input to the output, changing the word 
magic to first on every line which began 
with the letter a, changing magic to second 
on every line which began with the letter 6, 
and changing magic to third on every line 
which began with the letter c. All other 
words and all other lines are left unchanged. 

These rules are so simple that the 
easiest way to do this job is with a flag: 
int flag; 

%% 

A a {flag = 'a'; ECHO;} 

~b {flag = 'b'; ECHO;} 

*c {flag * V; ECHO;} 

\n {flag = 0 ; ECHO;} 

magic { 

switch (flag) 

( 

case 'a': printf("first"); break; 
case 'b': printf("second"); break; 
case 'c': printf("third"); break; 
default: ECHO; break; 


should be adequate. 

To handle the same problem with 
start conditions, each start condition must 
be introduced to Lex in the definitions sec¬ 
tion with a line reading 

%Start namel name2 ... 
where the conditions may be named in any 
order. The word Start may be abbreviated 
to s or S. The conditions may be refer¬ 
enced at the head of a rule with the <> 
brackets: 

<name 1 > expression 

is a rule which is only recognized when Lex 
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is in the start condition namel. To enter a 
start condition, execute the action state¬ 

2 

Bb 

ment 

26 

Zz 

BEGIN namel; 

27 

\n 

which changes the start condition to namel. 

28 

+ 

To resume the normal state, 

29 

- 

BEGIN 0; 

30 

0 

resets the initial condition of the Lex auto¬ 
maton interpreter. A rule may be active in 

31 

1 

several start conditions: 

<namel,name2,name3> 
is a legal prefix. Any rule not beginning 

39 

%T 

9 


with the <> prefix operator is always active. 

The same example as before can be 
written: 

% START AA BB CC 

%% 

"a 
A b 
*c 
\n 

<AA>magic 
<BB>magic 
<CC>magic 


{ECHO; BEGIN AA;) 
{ECHO; BEGIN BB;) 
{ECHO; BEGIN CC;} 
{ECHO; BEGIN 0;) 
printf("first"); 
printf("second"); 
printf("third"); 
where the logic is exactly the same as in the 
previous method of handling the problem, 
but Lex does the work rather than the 
user’s code. 


Sample character table, 
maps the lower and upper case letters 
together into the integers 1 through 26, 
newline into 27, + and - into 28 and 29, and 
the digits into 30 through 39. Note the 
escape for newline. If a table is supplied, 
every character that is to appear either in 
the rules or in any valid input must be 
included in the table. No character may be 
assigned the number 0, and no character 
may be assigned a bigger number than the 
size of the hardware character set. 

12. Summary of Source Format. 

The general form of a Lex source file 


is: 


(definitions! 


11. Character Set. 

The programs generated by Lex han¬ 
dle character I/O only through the routines 
input , output , and unput. Thus the charac¬ 
ter representation provided in these routines 
is accepted by Lex and employed to return 
values in yytext. For internal use a charac¬ 
ter is represented as a small integer which, 
if the standard library is used, has a value 
equal to the integer value of the bit pattern 
representing the character on the host com¬ 
puter. Normally, the letter a is represented 
as the same form as the character constant 
'a'. If this interpretation is changed, by 
providing I/O routines which translate the 
characters, Lex must be told about it, by 
giving a translation table. This table must 
be in the definitions section, and must be 
bracketed by lines containing only “%T”. 
The table contains lines of the form 
{integer} {character string) 
which indicate the value associated with 
each character. Thus the next example 
%T 

1 Aa 


%% 

(rules) 

%% 

{user subroutines) 

The definitions section contains a combina¬ 
tion of 

1) Definitions, in the form “name space 
translation”. 

2) Included code, in the form “space 
code”. 

3) Included code, in the form 

%{ 

code 

%} 

4) Start conditions, given in the form 

%S namel name2 ... 

5) Character set tables, in the form 

%T 

number space character-string 
%T 

6) Changes to internal array sizes, in the 
form 

%x nnn 

where nnn is a decimal integer 
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representing an array size and x 
selects the parameter as follows: 

Letter Parameter 

p positions 

n states 

e tree nodes 

a transitions 

k packed character classes 

o output array size 

Lines in the rules section have the form 
“expression action” where the action may 
be continued on succeeding lines by using 
braces to delimit it. 


Regular expressions in Lex use the fol- 


lowing 

operators: 

X 

the character "x" 

"x" 

an "x", even if x is an operator. 

\x 

an "x", even if x is an operator. 

[xy] 

the character x or y. 

[x-z] 

the characters x, y or z. 

r*] 

any character but x. 


any character but newline. 

"x 

an x at the beginning of a line. 

<y>x 

an x when Lex is in start condition y. 

x$ 

an x at the end of a line. 

X? 

an optional x. 

X* 

0,1,2, ... instances of x. 

x+ 

1,2,3, ... instances of x. 

xly 

an x or a y. 

(x) 

an x. 

x/y 

an x but only if followed by y. 

jxx| 

the translation of xx from the 


definitions section. 

x{m,nj 

m through n occurrences of x 

13. Caveats and Bugs. 


There are pathological expressions 
which produce exponential growth of the 
tables when converted to deterministic 
machines; fortunately, they are rare. 

REJECT does not rescan the input; 
instead it remembers the results of the pre¬ 
vious scan. This means that if a rule with 
trailing context is found, and REJECT exe¬ 
cuted, the user must not have used unput to 
change the characters forthcoming from the 
input stream. This is the only restriction 
on the user’s ability to manipulate the not- 
yet-processed input. 
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UNIX Implementation 

K. Thompson 


ABSTRACT 

This paper describes in high-level terms the implementation of the 
resident UNIXt kernel. This discussion is broken into three parts. The first 
part describes how the UNIX system views processes, users, and programs. 

The second part describes the I/O system. The last part describes the UNIX 
file system. 

1. INTRODUCTION 

The UNIX kernel consists of about 10,000 lines of C code and about 1,000 lines of assem¬ 
bly code. The assembly code can be further broken down into 200 lines included for the sake 
of efficiency (they could have been written in C) and 800 lines to perform hardware functions 
not possible in C. 

This code represents 5 to 10 percent of what has been lumped into the broad expression 
“the UNIX operating system.” The kernel is the only UNIX code that cannot be substituted by 
a user to his own liking. For this reason, the kernel should make as few real decisions as possi¬ 
ble. This does not mean to allow the user a million options to do the same thing. Rather, it 
means to allow only one way to do one thing, but have that way be the least-common divisor of 
all the options that might have been provided. 

What is or is not implemented in the kernel represents both a great responsibility and a 
great power. It is a soap-box platform on “the way things should be done.” Even so, if “the 
way” is too radical, no one will follow it. Every important decision was weighed carefully. 
Throughout, simplicity has been substituted for efficiency. Complex algorithms are used only 
if their complexity can be localized. 

2. PROCESS CONTROL 

In the UNIX system, a user executes programs in an environment called a user process. 
When a system function is required, the user process calls the system as a subroutine. At some 
point in this call, there is a distinct switch of environments. After this, the process is said to 
be a system process. In the normal definition of processes, the user and system processes are 
different phases of the same process (they never execute simultaneously). For protection, each 
system process has its own stack. 

The user process may execute from a read-only text segment, which is shared by all 
processes executing the same code. There is no functional benefit from shared-text segments. 
An efficiency benefit comes from the fact that there is no need to swap read-only segments out 
because the original copy on secondary memory is still current. This is a great benefit to 
interactive programs that tend to be swapped while waiting for terminal input. Furthermore, if 
two processes are executing simultaneously from the same copy of a read-only segment, only 
one copy needs to reside in primary memory. This is a secondary effect, because simultaneous 
execution of a program is not common. It is ironic that this effect, which reduces the use of 
primary memory, only comes into play when there is an overabundance of primary memory, 

t UNIX is a trademark of Bell Laboratories. 
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that is, when there is enough memory to keep waiting processes loaded. 

All current read-only text segments in the system are maintained from the text table. A 
text table entry holds the location of the text segment on secondary memory. If the segment is 
loaded, that table also holds the primary memory location and the count of the number of 
processes sharing this entry. When this count is reduced to zero, the entry is freed along with 
any primary and secondary memory holding the segment. When a process first executes a 
shared-text segment, a text table entry is allocated and the segment is loaded onto secondary 
memory. If a second process executes a text segment that is already allocated, the entry refer¬ 
ence count is simply incremented. 

A user process has some strictly private read-write data contained in its data segment. 
As far as possible, the system does not use the user’s data segment to hold system data. In 
particular, there are no I/O buffers in the user address space. 

The user data segment has two growing boundaries. One, increased automatically by the 
system as a result of memory faults, is used for a stack. The second boundary is only grown 
(or shrunk) by explicit requests. The contents of newly allocated primary memory is initialized 
to zero. 

Also associated and swapped with a process is a small fixed-size system data segment. 
This segment contains all the data about the process that the system needs only when the pro¬ 
cess is active. Examples of the kind of data contained in the system data segment are: saved 
central processor registers, open file descriptors, accounting information, scratch data area, and 
the stack for the system phase of the process. The system data segment is not addressable 
from the user process and is therefore protected. 

Last, there is a process table with one entry per process. This entry contains all the data 
needed by the system when the process is not active. Examples are the process’s name, the 
location of the other segments, and scheduling information. The process table entry is allo¬ 
cated when the process is created, and freed when the process terminates. This process entry 
is always directly addressable by the kernel. 

Figure 1 shows the relationships between the various process control data. In a sense, 
the process table is the definition of all processes, because all the data associated with a pro¬ 
cess may be accessed starting from the process table entry. 



Fig. 1—Process control data structure. 


2.1. Process creation and program execution 

Processes are created by the system primitive fork. The newly created process (child) is 
a copy of the original process (parent). There is no detectable sharing of primary memory 
between the two processes. (Of course, if the parent process was executing from a read-only 
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text segment, the child will share the text segment.) Copies of all writable data segments are 
made for the child process. Files that were open before the fork are truly shared after the 
fork. The processes are informed as to their part in the relationship to allow them to select 
their own (usually non-identical) destiny. The parent may wait for the termination of any of 
its children. 

A process may exec a file. This consists of exchanging the current text and data seg¬ 
ments of the process for new text and data segments specified in the file. The old segments 
are lost. Doing an exec does not change processes; the process that did the exec persists, but 
after the exec it is executing a different program. Files that were open before the exec remain 
open after the exec. 

If a program, say the first pass of a compiler, wishes to overlay itself with another pro-, 
gram, say the second pass, then it simply execs the second program. This is analogous to a 
“goto.” If a program wishes to regain control after execing a second program, it should fork a 
child process, have the child exec the second program, and have the parent wait for the child. 
This is analogous to a “call.” Breaking up the call into a binding followed by a transfer is simi¬ 
lar to the subroutine linkage in SL-5. 1 

2.2. Swapping 

The major data associated with a process (the user data segment, the system data seg¬ 
ment, and the text segment) are swapped to and from secondary memory, as needed. The user 
data segment and the system data segment are kept in contiguous primary memory to reduce 
swapping latency. (When low-latency devices, such as bubbles, CCDs, or scatter/gather devices, 
are used, this decision will have to be reconsidered.) Allocation of both primary and secondary 
memory is performed by the same simple first-fit algorithm. When a process grows, a new 
piece of primary memory is allocated. The contents of the old memory is copied to the new 
memory. The old memory is freed and the tables are updated. If there is not enough primary 
memory, secondary memory is allocated instead. The process is swapped out onto the secon¬ 
dary memory, ready to be swapped in with its new size. 

One separate process in the kernel, the swapping process, simply swaps the other 
processes in and out of primary memory. It examines the process table looking for a process 
that is swapped out and is ready to run. It allocates primary memory for that process and 
reads its segments into primary memory, where that process competes for the central processor 
with other loaded processes. If no primary memory is available, the swapping process makes 
memory available by examining the process table for processes that can be swapped out. It 
selects a process to swap out, writes it to secondary memory, frees the primary memory, and 
then goes back to look for a process to swap in. 

Thus there are two specific algorithms to the swapping process. Which of the possibly 
many processes that are swapped out is to be swapped in? This is decided by secondary 
storage residence time. The one with the longest time out is swapped in first. There is a slight 
penalty for larger processes. Which of the possibly many processes that are loaded is to be 
swapped out? Processes that are waiting for slow events (i.e., not currently running or waiting 
for disk I/O) are picked first, by age in primary memory, again with size penalties. The other 
processes are examined by the same age algorithm, but are not taken out unless they are at 
least of some age. This adds hysteresis to the swapping and prevents total thrashing. 

These swapping algorithms are the most suspect in the system. With limited primary 
memory, these algorithms cause total swapping. This is not bad in itself, because the swapping 
does not impact the execution of the resident processes. However, if the swapping device must 
also be used for file storage, the swapping traffic severely impacts the file system traffic. It is 
exactly these small systems that tend to double usage of limited disk resources. 





2.3. Synchronization and scheduling 

Process synchronization is accomplished by having processes wait for events. Events are 
represented by arbitrary integers. By convention, events are chosen to be addresses of tables 
associated with those events. For example, a process that is waiting for any of its children to 
terminate will wait for an event that is the address of its own process table entry. When a pro¬ 
cess terminates, it signals the event represented by its parent’s process table entry. Signaling 
an event on which no process is waiting has no effect. Similarly, signaling an event on which 
many processes are waiting will wake all of them up. This differs considerably from Dijkstra s 
P and V synchronization operations, 2 in that no memory is associated with events. Thus there 
need be no allocation of events prior to their use. Events exist simply by being used. 

On the negative side, because there is no memory associated with events, no notion of 
“how much” can be signaled via the event mechanism. For example, processes that want 
memory might wait on an event associated with memory allocation. When any amount of 
memory becomes available, the event would be signaled. All the competing processes would 
then wake up to fight over the new memory. (In reality, the swapping process is the only pro¬ 
cess that waits for primary memory to become available.) 

If an event occurs between the time a process decides to wait for that event and the time 
that process enters the wait state, then the process will wait on an event that has already hap¬ 
pened (and may never happen again). This race condition happens because there is no 
memory associated with the event to indicate that the event has occurred; the only action of an 
event is to change a set of processes from wait state to run state. This problem is relieved 
largely by the fact that process switching can only occur in the kernel by explicit calls to the 
event-wait mechanism. If the event in question is signaled by another process, then there is no 
problem. But if the event is signaled by a hardware interrupt, then special care must be taken. 
These synchronization races pose the biggest problem when UNIX is adapted to multiple- 
processor configurations. 3 

The event-wait code in the kernel is like a co-routine linkage. At any time, all but one of 
the processes has called event-wait. The remaining process is the one currently executing. 
When it calls event-wait, a process whose event has been signaled is selected and that process 
returns from its call to event-wait. 

Which of the runable processes is to run next? Associated with each process is a priority. 
The priority of a system process is assigned by the code issuing the wait on an event. This is 
roughly equivalent to the response that one would expect on such an event. Disk events have 
high priority, teletype events are low, and time-of-day events are very low. (From observation, 
the difference in system process priorities has little or no performance impact.) All user- 
process priorities are lower than the lowest system priority. User-process priorities are 
assigned by an algorithm based on the recent ratio of the amount of compute time to real time 
consumed by the process. A process that has used a lot of compute time in the last real-time 
unit is assigned a low user priority. Because interactive processes are characterized by low 
ratios of compute to real time, interactive response is maintained without any special arrange¬ 
ments. 

The scheduling algorithm simply picks the process with the highest priority, thus picking 
all system processes first and user processes second. The compute-to-real-time ratio is updated 
every second. Thus, all other things being equal, looping user processes will be scheduled 
round-robin with a 1-second quantum. A high-priority process waking up will preempt a run¬ 
ning, low-priority process. The scheduling algorithm has a very desirable negative feedback 
character. If a process uses its high priority to hog the computer, its priority will drop. At the 
same time, if a low-priority process is ignored for a long time, its priority will rise. 

3. I/O SYSTEM 

The I/O system is broken into two completely separate systems: the block I/O system and 
the character I/O system. In retrospect, the names should have been “structured I/O” and 
“unstructured I/O,” respectively; while the term “block I/O” has some meaning, “character 
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I/O” is a complete misnomer. 

Devices are characterized by a major device number, a minor device number, and a class 
(block or character). For each class, there is an array of entry points into the device drivers. 
The major device number is used to index the array when calling the code for a particular 
device driver. The minor device number is passed to the device driver as an argument. The 
minor number has no significance other than that attributed to it by the driver. Usually, the 
driver uses the minor number to access one of several identical physical devices. 

The use of the array of entry points (configuration table) as the only connection between 
the system code and the device drivers is very important. Early versions of the system had a 
much less formal connection with the drivers, so that it was extremely hard to handcraft 
differently configured systems. Now it is possible to create new device drivers in an average of 
a few hours. The configuration table in most cases is created automatically by a program that 
reads the system’s parts list. 

3.1. Block I/O system 

The model block I/O device consists of randomly addressed, secondary memory blocks of 
512 bytes each. The blocks are uniformly addressed 0, 1, ... up to the size of the device. The 
block device driver has the job of emulating this model on a physical device. 

The block I/O devices are accessed through a layer of buffering software. The system 
maintains a list of buffers (typically between 10 and 70) each assigned a device name and a 
device address. This buffer pool constitutes a data cache for the block devices. On a read 
request, the cache is searched for the desired block. If the block is found, the data are made 
available to the requester without any physical I/O. If the block is not in the cache, the least 
recently used block in the cache is renamed, the correct device driver is called to fill up the 
renamed buffer, and then the data are made available. Write requests are handled in an analo¬ 
gous manner. The correct buffer is found and relabeled if necessary. The write is performed 
simply by marking the buffer as “dirty.” The physical I/O is then deferred until the buffer is 
renamed. 

The benefits in reduction of physical I/O of this scheme are substantial, especially consid¬ 
ering the file system implementation. There are, however, some drawbacks. The asynchronous 
nature of the algorithm makes error reporting and meaningful user error handling almost 
impossible. The cavalier approach to I/O error handling in the UNIX system is partly due to 
the asynchronous nature of the block I/O system. A second problem is in the delayed writes. 
If the system stops unexpectedly, it is almost certain that there is a lot of logically complete, 
but physically incomplete, I/O in the buffers. There is a system primitive to flush all outstand¬ 
ing I/O activity from the buffers. Periodic use of this primitive helps, but does not solve, the 
problem. Finally, the associativity in the buffers can alter the physical I/O sequence from that 
of the logical I/O sequence. This means that there are times when data structures on disk are 
inconsistent, even though the software is careful to perform I/O in the correct order. On non- 
random devices, notably magnetic tape, the inversions of writes can be disastrous. The prob¬ 
lem with magnetic tapes is “cured” by allowing only one outstanding write request per drive. 

3.2. Character I/O system 

The character I/O system consists of all devices that do not fall into the block I/O model. 
This includes the “classical” character devices such as communications lines, paper tape, and 
line printers. It also includes magnetic tape and disks when they are not used in a stereotyped 
way, for example, 80-byte physical records on tape and track-at-a-time disk copies. In short, 
the character I/O interface means “everything other than block.” I/O requests from the user 
are sent to the device driver essentially unaltered. The implementation of these requests is, of 
course, up to the device driver. There are guidelines and conventions to help the implementa¬ 
tion of certain types of device drivers. 
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3.2.1. Disk drivers 

Disk drivers are implemented with a queue of transaction records. Each record holds a 
read/write flag, a primary memory address, a secondary memory address, and a transfer byte 
count. Swapping is accomplished by passing such a record to the swapping device driver. The 
block I/O interface is implemented by passing such records with requests to fill and empty sys¬ 
tem buffers. The character I/O interface to the disk drivers create a transaction record that 
points directly into the user area. The routine that creates this record also insures that the 
user is not swapped during this I/O transaction. Thus by implementing the general disk 
driver, it is possible to use the disk as a block device, a character device, and a swap device. 
The only really disk-specific code in normal disk drivers is the pre-sort of transactions to 
minimize latency for a particular device, and the actual issuing of the I/O request. 

3.2.2. Character lists 

Real character-oriented devices may be implemented using the common code to handle 
character lists. A character list is a queue of characters. One routine puts a character on a 
queue. Another gets a character from a queue. It is also possible to ask how many characters 
are currently on a queue. Storage for all queues in the system comes from a single common 
pool. Putting a character on a queue will allocate space from the common pool and link the 
character onto the data structure defining the queue. Getting a character from a queue returns 
the corresponding space to the pool. 

A typical character-output device (paper tape punch, for example) is implemented by 
passing characters from the user onto a character queue until some maximum number of char¬ 
acters is on the queue. The I/O is prodded to start as soon as there is anything on the queue 
and, once started, it is sustained by hardware completion interrupts. Each time there is a com¬ 
pletion interrupt, the driver gets the next character from the queue and sends it to the 
hardware. The number of characters on the queue is checked and, as the count falls through 
some intermediate level, an event (the queue address) is signaled. The process that is passing 
characters from the user to the queue can be waiting on the event, and refill the queue to its 
maximum when the event occurs. 

A typical character input device (for example, a paper tape reader) is handled in a very 
similar manner. 

Another class of character devices is the terminals. A terminal is represented by three 
character queues. There are two input queues (raw and canonical) and an output queue. 
Characters going to the output of a terminal are handled by common code exactly as described 
above. The main difference is that there is also code to interpret the output stream as ASCII 
characters and to perform some translations, e.g., escapes for deficient terminals. Another 
common aspect of terminals is code to insert real-time delay after certain control characters. 

Input on terminals is a little different. Characters are collected from the terminal and 
placed on a raw input queue. Some device-dependent code conversion and escape interpreta¬ 
tion is handled here. When a line is complete in the raw queue, an event is signaled. The code 
catching this signal then copies a line from the raw queue to a canonical queue performing the 
character erase and line kill editing. User read requests on terminals can be directed at either 
the raw or canonical queues. 

3.2.3. Other character devices 

Finally, there are devices that fit no general category. These devices are set up as charac¬ 
ter I/O drivers. An example is a driver that reads and writes unmapped primary memory as an 
I/O device. Some devices are too fast to be treated a character at time, but do not fit the disk 
I/O mold. Examples are fast communications lines and fast line printers. These devices either 
have their own buffers or “borrow” block I/O buffers for a while and then give them back. 
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4. THE FILE SYSTEM 

In the UNIX system, a file is a (one-dimensional) array of bytes. No other structure of 
files is implied by the system. Files are attached anywhere (and possibly multiply) onto a 
hierarchy of directories. Directories are simply files that users cannot write. For a further dis¬ 
cussion of the external view of files and directories, see Ref. 4. 

The UNIX file system is a disk data structure accessed completely through the block I/O 
system. As stated before, the canonical view of a “disk” is a randomly addressable array of 
512-byte blocks. A file system breaks the disk into four self-identifying regions. The first 
block (address 0) is unused by the file system. It is left aside for booting procedures. The 
second block (address 1) contains the so-called “super-block.” This block, among other things, 
contains the size of the disk and the boundaries of the other regions. Next comes the i-list, a 
list of file definitions. Each file definition is a 64-byte structure, called an i-node. The offset of 
a particular i-node within the i-list is called its i-number. The combination of device name 
(major and minor numbers) and i-number serves to uniquely name a particular file. After the 
i-list, and to the end of the disk, come free storage blocks that are available for the contents of 
files. 

The free space on a disk is maintained by a linked list of available disk blocks. Every 
block in this chain contains a disk address of the next block in the chain. The remaining space 
contains the address of up to 50 disk blocks that are also free. Thus with one I/O operation, 
the system obtains 50 free blocks and a pointer where to find more. The disk allocation algo¬ 
rithms are very straightforward. Since all allocation is in fixed-size blocks and there is strict 
accounting of space, there is no need to compact or garbage collect. However, as disk space 
becomes dispersed, latency gradually increases. Some installations choose to occasionally com¬ 
pact disk space to reduce latency. 

An i-node contains 13 disk addresses. The first 10 of these addresses point directly at the 
first 10 blocks of a file. If a file is larger than 10 blocks (5,120 bytes), then the eleventh 
address points at a block that contains the addresses of the next 128 blocks of the file. If the 
file is still larger than this (70,656 bytes), then the twelfth block points at up to 128 blocks, 
each pointing to 128 blocks of the file. Files yet larger (8,459,264 bytes) use the thirteenth 
address for a “triple indirect” address. The algorithm ends here with the maximum file size of 
1,082,201,087 bytes. 

A logical directory hierarchy is added to this flat physical structure simply by adding a 
new type of file, the directory. A directory is accessed exactly as an ordinary file. It contains 
16-byte entries consisting of a 14-byte name and an i-number. The root of the hierarchy is at 
a known i-number (viz., 2). The file system structure allows an arbitrary, directed graph of 
directories with regular files linked in at arbitrary places in this graph. In fact, very early 
UNIX systems used such a structure. Administration of such a structure became so chaotic 
that later systems were restricted to a directory tree. Even now, with regular files linked multi¬ 
ply into arbitrary places in the tree, accounting for space has become a problem. It may 
become necessary to restrict the entire structure to a tree, and allow a new form of linking that 
is subservient to the tree structure. 

The file system allows easy creation, easy removal, easy random accessing, and very easy 
space allocation. With most physical addresses confined to a small contiguous section of disk, 
it is also easy to dump, restore, and check the consistency of the file system. Large files suffer 
from indirect addressing, but the cache prevents most of the implied physical I/O without 
adding much execution. The space overhead properties of this scheme are quite good. For 
example, on one particular file system, there are 25,000 files containing 130M bytes of data-file 
content. The overhead (i-node, indirect blocks, and last block breakage) is about 11.5M bytes. 
The directory structure to support these files has about 1,500 directories containing 0.6M bytes 
of directory content and about 0.5M bytes of overhead in accessing the directories. Added up 
any way, this comes out to less than a 10 percent overhead for actual stored data. Most sys¬ 
tems have this much overhead in padded trailing blanks alone. 





4.1. File system implementation 

Because the i-node defines a file, the implementation of the file system centers around 
access to the i-node. The system maintains a table of all active i-nodes. As a new file is 
accessed, the system locates the corresponding i-node, allocates an i-node table entry, and 
reads the i-node into primary memory. As in the buffer cache, the table entry is considered to 
be the current version of the i-node. Modifications to the i-node are made to the table entry. 
When the last access to the i-node goes away, the table entry is copied back to the secondary 
store i-list and the table entry is freed. 

All I/O operations on files are carried out with the aid of the corresponding i-node table 
entry. The accessing of a file is a straightforward implementation of the algorithms mentioned 
previously. The user is not aware of i-nodes and i-numbers. References to the file system are 
made in terms of path names of the directory tree. Converting a path name into an i-node 
table entry is also straightforward. Starting at some known i-node (the root or the current 
directory of some process), the next component of the path name is searched by reading the 
directory. This gives an i-number and an implied device (that of the directory). Thus the next 
i-node table entry can be accessed. If that was the last component of the path name, then this 
i-node is the result. If not, this i-node is the directory needed to look up the next component 
of the path name, and the algorithm is repeated. 

The user process accesses the file system with certain primitives. The most common of 
these are open, create, read, write, seek, and close. The data structures maintained are 
shown in Fig. 2. 


Fig. 2—File system data structure. 

In the system data segment associated with a user, there is room for some (usually between 10 
and 50) open files. This open file table consists of pointers that can be used to access 
corresponding i-node table entries. Associated with each of these open files is a current I/O 
pointer. This is a byte offset of the next read/write operation on the file. The system treats 
each read/write request as random with an implied seek to the I/O pointer. The user usually 
thinks of the file as sequential with the I/O pointer automatically counting the number of bytes 
that have been read/written from the file. The user may, of course, perform random I/O by 
setting the I/O pointer before reads/writes. 
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With file sharing, it is necessary to allow related processes to share a common I/O pointer 
and yet have separate I/O pointers for independent processes that access the same file. With 
these two conditions, the I/O pointer cannot reside in the i-node table nor can it reside in the 
list of open files for the process. A new table (the open file table) was invented for the sole 
purpose of holding the I/O pointer. Processes that share the same open file (the result of 
forks) share a common open file table entry. A separate open of the same file will only share 
the i-node table entry, but will have distinct open file table entries. 

The main file system primitives are implemented as follows, open converts a file system 
path name into an i-node table entry. A pointer to the i-node table entry is placed in a newly 
created open file table entry. A pointer to the file table entry is placed in the system data seg¬ 
ment for the process, create first creates a new i-node entry, writes the i-number into a direc¬ 
tory, and then builds the same structure as for an open, read and write just access the i- 
node entry as described above, seek simply manipulates the I/O pointer. No physical seeking 
is done, close just frees the structures built by open and create. Reference counts are kept 
on the open file table entries and the i-node table entries to free these structures after the last 
reference goes away, unlink simply decrements the count of the number of directories point¬ 
ing at the given i-node. When the last reference to an i-node table entry goes away, if the i- 
node has no directories pointing to it, then the file is removed and the i-node is freed. This 
delayed removal of files prevents problems arising from removing active files. A file may be 
removed while still open. The resulting unnamed file vanishes when the file is closed. This is 
a method of obtaining temporary files. 

There is a type of unnamed FIFO file called a pipe. Implementation of pipes consists of 
implied seeks before each read or write in order to implement first-in-first-out. There are 
also checks and synchronization to prevent the writer from grossly outproducing the reader 
and to prevent the reader from overtaking the writer. 

4.2. Mounted file systems 

The file system of a UNIX system starts with some designated block device formatted as 
described above to contain a hierarchy. The root of this structure is the root of the UNIX file 
system. A second formatted block device may be mounted at any leaf of the current hierarchy. 
This logically extends the current hierarchy. The implementation of mounting is trivial. A 
mount table is maintained containing pairs of designated leaf i-nodes and block devices. When 
converting a path name into an i-node, a check is made to see if the new i-node is a designated 
leaf. If it is, the i-node of the root of the block device replaces it. 

Allocation of space for a file is taken from the free pool on the device on which the file 
lives. Thus a file system consisting of many mounted devices does not have a common pool of 
free secondary storage space. This separation of space on different devices is necessary to 
allow easy unmounting of a device. 

4.3. Other system functions 

There are some other things that the system does for the user-a little accounting, a little 
tracing/debugging, and a little access protection. Most of these things are not very well 
developed because our use of the system in computing science research does not need them. 
There are some features that are missed in some applications, for example, better inter-process 
communication. 

The UNIX kernel is an I/O multiplexer more than a complete operating system. This is 
as it should be. Because of this outlook, many features are found in most other operating sys¬ 
tems that are missing from the UNIX kernel. For example, the UNIX kernel does not support 
file access methods, file disposition, file formats, file maximum size, spooling, command 
language, logical records, physical records, assignment of logical file names, logical file names, 
more than one character set, an operator’s console, an operator, log-in, or log-out. Many of 
these things are symptoms rather than features. Many of these things are implemented in user 
software using the kernel as a tool. A good example of this is the command language. 4 Each 
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user may have his own command language. Maintenance of such code is as easy as maintain¬ 
ing user code. The idea of implementing “system” code with general user primitives comes 
directly from MULTICS. 5 
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ABSTRACT 

This document summarizes the facilities provided by the 4.2BSD version 
of the UNIX operating system. It does not attempt to act as a tutorial for use 
of the system nor does it attempt to explain or justify the design of the system 
facilities. It gives neither motivation nor implementation details, in favor of 
brevity. 

The first section describes the basic kernel functions provided to a UNIX 
process: process naming and protection, memory management, software inter¬ 
rupts, object references (descriptors), time and statistics functions, and 
resource controls. These facilities, as well as facilities for bootstrap, shutdown 
and process accounting, are provided solely by the kernel. 

The second section describes the standard system abstractions for files 
and file systems, communication, terminal handling, and process control and 
debugging. These facilities are implemented by the operating system or by net¬ 
work server processes. 


* UNIX is a trademark of Bell Laboratories. 
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.3. Non-blocking and asynchronous operations 

2.2. File system 

.1 Overview 
.2. Naming 

.3. Creation and removal 

.3.1. Directory creation and removal 

.3.2. File creation 

.3.3. Creating references to devices 

.3.4. Portal creation 

.3.6. File, device, and portal removal 

.4. Reading and modifying file attributes 

.5. Links and renaming 

.6. Extension and truncation 

.7. Checking accessibility 

.8. Locking 

.9. Disc quotas 

2.3. Inteprocess communication 

.1. Interprocess communication primitives 
.1.1. Communication domains 

.1.2. Socket types and protocols 

.1.3. Socket creation, naming and service establishment 

.1.4. Accepting connections 

.1.5. Making connections 

.1.6. Sending and receiving data 

.1.7. Scatter/gather and exchanging access rights 

.1.8. Using read and write with sockets 

.1.9. Shutting down halves of full-duplex connections 

.1.10. Socket and protocol options 

.2. UNIX domain 

.2.1. Types of sockets 

.2.2. Naming 

.2.3. Access rights transmission 

.3. INTERNET domain 

.3.1. Socket types and protocols 

.3.2. Socket naming 

.3.3. Access rights transmission 

.3.4. Raw access 

2.4. Terminals and devices 

.1. Terminals 

.1.1. Terminal input 

.1.1.1 Input modes 

.1.1.2 Interrupt characters 

.1.1.3 Line editing 

.1.2. Terminal output 

.1.3. Terminal control operations 

.1.4. Terminal hardware support 

.2. Structured devices 

.3. Unstructured devices 
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2.5. Process control and debugging 
I. Summary of facilities 
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0. Notation and types 


The notation used to describe system calls is a variant of a C language call, consisting of 
a prototype call followed by declaration of parameters and results. An additional keyword 
result, not part of the normal C language, is used to indicate which of the declared entities 
receive results. As an example, consider the read call, as described in section 2.1: 


cc = read(fd, buf, nbytes); 

result int cc; int fd; result char *buf; int nbytes; 

The first line shows how the read routine is called, with three parameters. As shown on the 
second line cc is an integer and read also returns information in the parameter buf. 

Description of all error conditions arising from each system call is not provided here; they 
appear in the programmer’s manual. In particular, when accessed from the C language, many 
calls return a characteristic -1 value when an error occurs, returning the error code in the glo¬ 
bal variable errno. Other languages may present errors in different ways. 


A number of system standard types are defined in the include file <sys/types.h> and used 

in the specifications here and in many C programs. These include caddr_t giving a memory 

address (typically as a character pointer), off_t giving a file offset (typically as a long integer), 

and a set of unsigned types u_char, u_short, u_int and u_long, shorthand names for 

unsigned char, unsigned short, etc. 



i 
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1. Kernel primitives 


The facilities available to a UNIX user process are logically divided into two parts: kernel 
facilities directly implemented by UNIX code running in the operating system, and system 
facilities implemented either by the system, or in cooperation with a server process. These ker¬ 
nel facilities are described in this section 1. 


The facilities implemented in the kernel are those which define the UNIX virtual 
machine which each process runs in. Like many real machines, this virtual machine has 
memory management hardware, an interrupt facility, timers and counters. The UNIX virtual 
machine also allows access to files and other objects through a set of descriptors. Each 
descriptor resembles a device controller, and supports a set of operations. Like devices on real 
machines, some of which are internal to the machine and some of which are external, parts of 
the descriptor machinery are built-in to the operating system, while other parts are often 
implemented in server processes on other machines. The facilities provided through the 
descriptor machinery are described in section 2. 
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l.l. Processes and protection 


1.1.1. Host and process identifiers 

Each UNIX host has associated with it a 32-bit host id, and a host name of up to 255 
characters. These are set (by a privileged user) and returned by the calls: 

sethostid(hostid) 
long hostid; 

hostid = gethostidO; 
result long hostid; 

sethostname(name, len) 
char *name; int len; 

len = gethostname(buf, buflen) 
result int len; result char *buf; int buflen; 

On each host runs a set of processes. Each process is largely independent of other processes, 
having its own protection domain, address space, timers, and an independent set of references 
to system or user implemented objects. 

Each process in a host is named by an integer called the process id. This number is in 
the range 1-30000 and is returned by the getpid routine: 

pid = getpid(); 
result int pid; 

On each UNIX host this identifier is guaranteed to be unique; in a multi-host environment, the 
(hostid, process id) pairs are guaranteed unique. 

1.1.2. Process creation and termination 

A new process is created by making a logical duplicate of an existing process: 

pid = fork(); 
result int pid; 

The fork call returns twice, once in the parent process, where pid is the process identifier of 
the child, and once in the child process where pid is 0. The parent-child relationship induces a 
hierarchical structure on the set of processes in the system. 

A process may terminate by executing an exit call: 

exit(status) 
int status; 

returning 8 bits of exit status to its parent. 

When a child process exits or terminates abnormally, the parent process receives infor¬ 
mation about any event which caused termination of the child process. A second call provides 
a non-blocking interface and may also be used to retrieve information about resources con¬ 
sumed by the process during its lifetime. 




^include <sys/wait.h> 
pid = wait(astatus); 

result int pid; result union wait *astatus; 

pid = wait3(astatus, options, arusage); 
result int pid; result union waitstatus *astatus; 
int options; result struct rusage *arusage; 

A process can overlay itself with the memory image of another process, passing the newly 
created process a set of parameters, using the call: 

execve(name, argv, envp) 
char *name, **argv, **envp; 

The specified name must be a file which is in a format recognized by the system, either a 
binary executable file or a file which causes the execution of a specified interpreter program to 
process its contents. 

1.1.3. User and group ids 

Each process in the system has associated with it two user-id’s: a real user id and a 
effective user id, both non-negative 16 bit integers. Each process has an real accounting group 
id and an effective accounting group id and a set of access group id's. The group id’s are non¬ 
negative 16 bit integers. Each process may be in several different access groups, with the max¬ 
imum concurrent number of access groups a system compilation parameter, the constant 
NGROUPS in the file <sys/param.h>, guaranteed to be at least 8. 

The real and effective user ids associated with a process are returned by: 

ruid = getuidO; 
result int ruid; 

euid = geteuidO; 
result int euid; 

the real and effective accounting group ids by: 

rgid = getgidO; 
result int rgid; 

egid = getegidO; 
result int egid; 

and the access group id set is returned by a getgroups call: 

ngroups = getgroups(gidsetsize, gidset); 

result int ngroups; int gidsetsize; result int gidset[gidsetsize]; 

The user and group id’s are assigned at login time using the setreuid , setregid , and set- 
groups calls: 
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setreuid(ruid, euid); 
int ruid, euid; 


setregid(rgid, egid); 
int rgid, egid; 


setgroups(gidsetsize, gidset) 

int gidsetsize; int gidset [gidsetsize]; 


The setreuid call sets both the real and effective user-id’s, while the setregid call sets both the 
real and effective accounting group id’s. Unless the caller is the super-user, ruid must be equal 
to either the current real or effective user-id, and rgid equal to either the current real or 
effective accounting group id. The setgroups call is restricted to the super-user. 


1.1.4. Process groups 


Each process in the system is also normally associated with a process group. The group 
of processes in a process group is sometimes referred to as a job and manipulated by high-level 
system software (such as the shell). The current process group of a process is returned by the 
getpgrp call: 


pgrp = getpgrp (pid); 
result int pgrp; int pid; 


o 


When a process is in a specific process group it may receive software interrupts affecting the 
group, causing the group to suspend or resume execution or to be interrupted or terminated. 
In particular, a system terminal has a process group and only processes which are in the pro¬ 
cess group of the terminal may read from the terminal, allowing arbitration of terminals among 
several different jobs. 


The process group associated with a process may be changed by the setpgrp call: 


setpgrp(pid, pgrp); 
int pid, pgrp; 


Newly created processes are assigned process id’s distinct from all processes and process 
groups, and the same process group as their parent. A normal (unprivileged) process may set 
its process group equal to its process id. A privileged process may set the process group of any 
process to any value. 
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1.2. Memory management! 


1.2.1. Text, data and stack 

Each process begins execution with three logical areas of memory called text, data and 
stack. The text area is read-only and shared, while the data and stack areas are private to the 
process. Both the data and stack areas may be extended and contracted on program request. 
The call 

addr = sbrk(incr); 

result caddr_t addr; int incr; 

changes the size of the data area by incr bytes and returns the new end of the data area, while 

addr = sstk(incr); 

result caddr_t addr; int incr; 

changes the size of the stack area. The stack area is also automatically extended as needed. 
On the VAX the text and data areas are adjacent in the PO region, while the stack section is in 
the PI region, and grows downward. 

1.2.2. Mapping pages 

The system supports sharing of data between processes by allowing pages to be mapped 
into memory. These mapped pages may be shared with other processes or private to the pro¬ 
cess. Protection and sharing options are defined in <mman.h> as: 

/* protections are chosen from these bits, or-ed together */ 

#define PROT_READ 0x4 /* pages can be read */ 

#define PROT_WRITE 0x2 /* pages can be written */ 

^define PROT_EXEC 0x1 /* pages can be executed */ 

/* sharing types; choose either SHARED or PRIVATE */ 

#define MAP_SHARED 1 /* share changes */ 

^define MAP_PRIVATE 2 /* changes are private */ 

The cpu-dependent size of a page is returned by the getpagesize system call: 

pagesize = getpagesizeO; 
result int pagesize; 

The call: 

mmap(addr, len, prot, share, fd, pos); 

caddr_t addr; int len, prot, share, fd; off_t pos; 

causes the pages starting at addr and continuing for len bytes to be mapped from the object 
represented by descriptor fd, at absolute position pos. The parameter share specifies whether 
modifications made to this mapped copy of the page, are to be kept private, or are to be shared 
with other references. The parameter prot specifies the accessibility of the mapped pages. The 
addr, len, and pos parameters must all be multiples of the pagesize. 

A process can move pages within its own memory by using the mremap call: 

mremapfaddr, len, prot, share, fromaddr); 

caddr_t addr; int len, prot, share; caddr_t fromaddr; 

This call maps the pages starting at fromaddr to the address specified by addr. 

t This section represents the interface planned for later releases of the system. Of the calls described in this 
section, only sbrk and getpagesize are included in 4.2BSD. 
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A mapping can be removed by the call 

munmap(addr, len); 
caddr_t addr; int len; 

This causes further references to these pages to refer to private pages initialized to zero. 

1.2.3. Page protection control 

A process can control the protection of pages using the call 

mprotect(addr, len, prot); 
caddr_t addr; int len, prot; 

This call changes the specified pages to have protection prot. 

1.2.4. Giving and getting advice 

A process that has knowledge of its memory behavior may use the madvise call: 

madvise(addr, len, behav); 
caddr_t addr; int len, behav; 

Behav describes expected behavior, as given in <mman.h>: 




#define MADV_NORMAL 0 /* no further special treatment */ 

#define MADV_RANDOM 1 /* expect random page references */ 

^define MADV_SEQUENTIAL 2/* expect sequential references */ 

#define MADV_WILLNEED 3 /* will need these pages */ 

^define MADV_DONTNEED 4 /* don’t need these pages */ 


Finally, a process may obtain information about whether pages are core resident by using the 


call 



mincore(addr, len, vec) 

caddr_t addr; int len; result char *vec; 

Here the current core residency of the pages is returned in the character array vec , with a value 
of 1 meaning that the page is in-core. 
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1.3. Signals 


1.3.1. Overview 

The system defines a set of signals that may be delivered to a process. Signal delivery 
resembles the occurrence of a hardware interrupt: the signal is blocked from further 
occurrence, the current process context is saved, and a new one is built. A process may specify 
the handler to which a signal is delivered, or specify that the signal is to be blocked or ignored. 
A process may also specify that a default action is to be taken when signals occur. 

Some signals will cause a process to exit when they are not caught. This may be accom¬ 
panied by creation of a core image file, containing the current memory image of the process for 
use in post-mortem debugging. A process may choose to have signals delivered on a special 
stack, so that sophisticated software stack manipulations are possible. 

All signals have the same priority. If multiple signals are pending simultaneously, the 
order in which they are delivered to a process is implementation specific. Signal routines exe¬ 
cute with the signal that caused their invocation blocked, but other signals may yet occur. 
Mechanisms are provided whereby critical sections of code may protect themselves against the 
occurrence of specified signals. 

1.3.2. Signal types 

The signals defined by the system fall into one of five classes: hardware conditions, 
software conditions, input/output notification, process control, or resource control. The set of 
signals is defined in the file <signal.h>. 

Hardware signals are derived from exceptional conditions which may occur during execu¬ 
tion. Such signals include SIGFPE representing floating point and other arithmetic excep¬ 
tions, SIGILL for illegal instruction execution, SIGSEGV for addresses outside the currently 
assigned area of memory, and SIGBUS for accesses that violate memory protection con¬ 
straints. Other, more cpu-specific hardware signals exist, such as those for the various 
customer-reserved instructions on the VAX (SIGIOT, SIGEMT, and SIGTRAP). 

Software signals reflect interrupts generated by user request: SIGINT for the normal 
interrupt signal; SIGQUIT for the more powerful quit signal, that normally causes a core image 
to be generated; SIGHUP and SIGTERM that cause graceful process termination, either 
because a user has “hung up”, or by user or program request; and SIGKILL, a more powerful 
termination signal which a process cannot catch or ignore. Other software signals (SIGALRM, 
SIGVTALRM, SIGPROF) indicate the expiration of interval timers. 

A process can request notification via a SIGIO signal when input or output is possible on 
a descriptor, or when a non-blocking operation completes. A process may request to receive a 
SIGURG signal when an urgent condition arises. 

A process may be stopped by a signal sent to it or the members of its process group. The 
SIGSTOP signal is a powerful stop signal, because it cannot be caught. Other stop signals 
SIGTSTP, SIGTTIN, and SIGTTOU are used when a user request, input request, or output 
request respectively is the reason the process is being stopped. A SIGCONT signal is sent to a 
process when it is continued from a stopped state. Processes may receive notification with a 
SIGCHLD signal when a child process changes state, either by stopping or by terminating. 

Exceeding resource limits may cause signals to be generated. SIGXCPU occurs when a 
process nears its CPU time limit and SIGXFSZ warns that the limit on file size creation has 
been reached. 

1.3.3. Signal handlers 

A process has a handler associated with each signal that controls the way the signal is 
delivered. The call 
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^include <signal.h> 
struct sigvec { 


int 

(*sv_handler)() 

int 

sv_mask; 

int 

sv_onstack; 


sigvec(signo, sv, osv) 

int signo; struct sigvec *sv; result struct sigvec *osv; 

assigns interrupt handler address sv _ handler to signal signo. Each handler address specifies 

either an interrupt routine for the signal, that the signal is to be ignored, or that a default 

action (usually process termination) is to occur if the signal occurs. The constants SIG_IGN 

and SIG_DEF used as values for sv _ handler cause ignoring or defaulting of a condition. The 

sv _ mask and sv _ onstack values specify the signal mask to be used when the handler is 

invoked and whether the handler should operate on the normal run-time stack or a special sig¬ 
nal stack (see below). If osv is non-zero, the previous signal vector is returned. 

When a signal condition arises for a process, the signal is added to a set of signals pend¬ 
ing for the process. If the signal is not currently blocked by the process then it will be 
delivered. The process of signal delivery adds the signal to be delivered and those signals 

specified in the associated signal handler’s sv _ mask to a set of those masked for the process, 

saves the current process context, and places the process in the context of the signal handling 
routine. The call is arranged so that if the signal handling routine exits normally the signal 
mask will be restored and the process will resume execution in the original context. If the pro¬ 
cess wishes to resume in a different context, then it must arrange to restore the signal mask 
itself. 

The mask of blocked signals is independent of handlers for signals. It prevents signals 
from being delivered much as a raised hardware interrupt priority level prevents hardware 
interrupts. Preventing an interrupt from occurring by changing the handler is analogous to 
disabling a device from further interrupts. 

The signal handling routine sv _ handler is called by a C call of the form 

(*sv_handler)(signo, code, scp); 

int signo; long code; struct sigcontext *scp; 

The signo gives the number of the signal that occurred, and the code , a word of information 
supplied by the hardware. The scp parameter is a pointer to a machine-dependent structure 
containing the information for restoring the context before the signal. 

1.3.4. Sending signals 

A process can send a signal to another process or group of processes with the calls: 

kill(pid, signo) 
int pid, signo; 

killpgrp(pgrp, signo) 
int pgrp, signo; 

Unless the process sending the signal is privileged, it and the process receiving the signal must 
have the same effective user id. 

Signals are also sent implicitly from a terminal device to the process group associated 
with the terminal when certain input characters are typed. 


* 
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1.3.5. Protecting critical sections 


To block a section of code against one or more signals, a sigblock call may be used to add 
a set of signals to the existing mask, returning the old mask: 


oldmask = sigblock(mask); 
result long oldmask; long mask; 


The old mask can then be restored later with sigsetmask , 


oldmask = sigsetmask(mask); 
result long oldmask; long mask; 

The sigblock call can be used to read the current mask by specifying an empty mask . 

It is possible to check conditions with some signals blocked, and then to pause waiting for 
a signal and restoring the mask, by using: 

sigpause(mask); 
long mask; 


1.3.6. Signal stacks 

Applications that maintain complex or fixed size stacks can use the call 


struct sigstack { 

caddr_t 

int 


ss_sp; 

ss_onstack; 


sigstack(ss, oss) 

struct sigstack *ss; result struct sigstack *oss; 

to provide the system with a stack based at ss_sp for delivery of signals. The value 

ss onstack indicates whether the process is currently on the signal stack, a notion maintained 

in software by the system. 

When a signal is to be delivered, the system checks whether the process is on a signal 
stack. If not, then the process is switched to the signal stack for delivery, with the return from 
the signal arranged to restore the previous stack. 

If the process wishes to take a non-local exit from the signal routine, or run code from 
the signal stack that uses a different stack, a sigstack call should be used to reset the signal 
stack. 
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1.4. Timers 


1.4.1. Real time 

The system’s notion of the current Greenwich time and the current time zone is set and 
returned by the call by the calls: 

#include <sys/time.h> 

settimeofday(tvp, tzp); 
struct timeval *tp; 
struct timezone *tzp; 

gettimeofday(tp, tzp); 
result struct timeval *tp; 
result struct timezone *tzp; 

where the structures are defined in <sys/time.h> as: 

struct timeval { 
long 
long 


struct timezone j 
int 
int 

I; 

Earlier versions of UNIX contained only a 1-second resolution version of this call, which 
remains as a library routine: 

time(tvsec) 
result long *tvsec; 

returning only the tv_sec field from the gettimeofday call. 

1.4.2. Interval time 

The system provides each process with three interval timers, defined in <sys/time.h>: 

#define ITIMER_REAL 0 /* real time intervals */ 

^define ITIMER_VIRTUAL 1 /* virtual time intervals V 

#define ITIMER_PROF 2 /* user and system virtual time */ 

The ITIMER_REAL timer decrements in real time. It could be used by a library routine to 

maintain a wakeup service queue. A SIGALRM signal is delivered when this timer expires. 

The ITIMER_VIRTUAL timer decrements in process virtual time. It runs only when 

the process is executing. A SIGVTALRM signal is delivered when it expires. 

The ITIMER_PROF timer decrements both in process virtual time and when the sys¬ 

tem is running on behalf of the process. It is designed to be used by processes to statistically 
profile their execution. A SIGPROF signal is delivered when it expires. 

A timer value is defined by the itimerval structure: 
struct itimerval { 

struct timeval it interval; /* timer interval */ 

struct timeval it value; /* current value */ 

); 

and a timer is set or read by the call: 


tv_sec; 

tv_usee; 


/* seconds since Jan 1, 1970 */ 
/* and microseconds */ 


tz_minuteswest; 

tz_dsttime; 


/* of Greenwich */ 

/* type of dst correction to apply */ 
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getitimer(which, value); 

int which; result struct itimerval *value; 

setitimer(which, value, ovalue); 

int which; struct itimerval *value; result struct itimerval *ovalue; 

The third argument to setitimer specifies an optional structure to receive the previous contents 
of the interval timer. A timer can be disabled by specifying a timer value of 0. 

The system rounds argument timer intervals to be not less than the resolution of its 
clock. This clock resolution can be determined by loading a very small value into a timer and 
reading the timer back to see what value resulted. 

The alarm system call of earlier versions of UNIX is provided as a library routine using 

the ITIMER_REAL timer. The process profiling facilities of earlier versions of UNIX remain 

because it is not always possible to guarantee the automatic restart of system calls after receipt 
of a signal. 

profil(buf, bufsize, offset, scale); 
result char *buf; int bufsize, offset, scale; 




1.5. Descriptors 


1.5.1. The reference table 

Each process has access to resources through descriptors. Each descriptor is a handle 
allowing the process to reference objects such as files, devices and communications links. 

Rather than allowing processes direct access to descriptors, the system introduces a level 
of indirection, so that descriptors may be shared between processes. Each process has a 
descriptor reference table , containing pointers to the actual descriptors. The descriptors them¬ 
selves thus have multiple references, and are reference counted by the system. 

Each process has a fixed size descriptor reference table, where the size is returned by the 
getdtablesize call: 

nds = getdtablesizeO; 
result int nds; 

and guaranteed to be at least 20. The entries in the descriptor reference table are referred to 
by small integers; for example if there are 20 slots they are numbered 0 to 19. 

1.5.2. Descriptor properties 

Each descriptor has a logical set of properties maintained by the system and defined by 
its type. Each type supports a set of operations; some operations, such as reading and writing, 
are common to several abstractions, while others are unique. The generic operations applying 
to many of these types are described in section 2.1. Naming contexts, files and directories are 
described in section 2.2. Section 2.3 describes communications domains and sockets. Termi¬ 
nals and (structured and unstructured) devices are described in section 2.4. 

1.5.3. Managing descriptor references 

A duplicate of a descriptor reference may be made by doing 

new = dup(old); 
result int new; int old; 

returning a copy of descriptor reference old indistinguishable from the original. The new 
chosen by the system will be the smallest unused descriptor reference slot. A copy of a 
descriptor reference may be made in a specific slot by doing 

dup2(old, new); 
int old, new; 

The dup2 call causes the system to deallocate the descriptor reference current occupying slot 
new, if any, replacing it with a reference to the same descriptor as old. This deallocation is 
also performed by: 

close(old); 
int old; 

1.5.4. Multiplexing requests 

The system provides a standard way to do synchronous and asynchronous multiplexing of 
operations. 

Synchronous multiplexing is performed by using the select call: 

nds = select(nd, in, out, except, tvp); 
result int nds; int nd; result *in, *out, *except; 
struct timeval *tvp; 

The select call examines the descriptors specified by the sets in, out and except, replacing the 
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specified bit masks by the subsets that select for input, output, and exceptional conditions 
respectively (nd indicates the size, in bytes, of the bit masks). If any descriptors meet the fol¬ 
lowing criteria, then the number of such descriptors is returned in nds and the bit masks are 
updated. 

• A descriptor selects for input if an input oriented operation such as read or receive is pos¬ 
sible, or if a connection request may be accepted (see section 2.3.1.4). 

• A descriptor selects for output if an output oriented operation such as write or send is 
possible, or if an operation that was “in progress”, such as connection establishment, has 
completed (see section 2.1.3). 

• A descriptor selects for an exceptional condition if a condition that would cause a 
SIGURG signal to be generated exists (see section 1.3.2). 

If none of the specified conditions is true, the operation blocks for at most the amount of time 
specified by tvp , or waits for one of the conditions to arise if tvp is given as 0. 

Options affecting i/o on a descriptor may be read and set by the call: 

dopt = fcntl(d, cmd, arg) 
result int dopt; int d, cmd, arg; 


/* interesting values for cmd */ 

^define F_SETFL 3 

#define F_GETFL 4 

#define F_SETOWN 5 

#define F_GETOWN 6 


/* set descriptor options */ 

/* get descriptor options */ 

/* set descriptor owner (pid/pgrp) */ 
/* get descriptor owner (pid/pgrp) */ 


The F_SETFL cmd may be used to set a descriptor in non-blocking i/o mode and/or enable 

signalling when i/o is possible. F_SETOWN may be used to specify a process or process 

group to be signalled when using the latter mode of operation. 

Operations on non-blocking descriptors will either complete immediately, note an error 
EWOULDBLOCK, partially complete an input or output operation returning a partial count, 
or return an error EINPROGRESS noting that the requested operation is in progress. A 
descriptor which has signalling enabled will cause the specified process and/or process group be 
signaled, with a SIGIO for input, output, or in-progress operation complete, or a SIGURG for 
exceptional conditions. 

For example, when writing to a terminal using non-blocking output, the system will 
accept only as much data as there is buffer space for and return; when making a connection on 
a socket , the operation may return indicating that the connection establishment is “in pro¬ 
gress”. The select facility can be used to determine when further output is possible on the ter¬ 
minal, or when the connection establishment attempt is complete. 


1.5.5. Descriptor wrapping.t 

A user process may build descriptors of a specified type by wrapping a communications 
channel with a system supplied protocol translator: 

new = wrap(old, proto) 

result int new; int old; struct dprop *proto; 

Operations on the descriptor old are then translated by the system provided protocol translator 
into requests on the underyling object old in a way defined by the protocol. The protocols sup¬ 
ported by the kernel may vary from system to system and are described in the programmers 
manual. 


t The facilities described in this section are not included in 4.2BSD. 
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Protocols may be based on communications multiplexing or a rights-passing style of han¬ 
dling multiple requests made on the same object. For instance, a protocol for implementing a 
file abstraction may or may not include locally generated “read-ahead” requests. A protocol 
that provides for read-ahead may provide higher performance but have a more difficult imple¬ 
mentation. 

Another example is the terminal driving facilities. Normally a terminal is associated with 
a communications line and the terminal type and standard terminal access protocol is wrapped 
around a synchronous communications line and given to the user. If a virtual terminal is 
required, the terminal driver can be wrapped around a communications link, the other end of 
which is held by a virtual terminal protocol interpreter. 
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1.6. Resource controls 


1.6.1. Process priorities 

The system gives CPU scheduling priority to processes that have not used CPU time 
recently. This tends to favor interactive processes and processes that execute only for short 
periods. It is possible to determine the priority currently assigned to a process, process group, 
or the processes of a specified user, or to alter this priority using the calls: 

^define PRIO_PROCESS 0 /* process */ 

^define PRIO_PGRP 1 /* process group */ 

#define PRIO_USER 2 /* user id */ 

prio = getpriority(which, who); 
result int prio; int which, who; 

setpriority(which, who, prio); 
int which, who, prio; 

The value prio is in the range -20 to 20. The default priority is 0; lower priorities cause more 
favorable execution. The getpriority call returns the highest priority (lowest numerical value) 
enjoyed by any of the specified processes. The setpriority call sets the priorities of all of the 
specified processes to the specified value. Only the super-user may lower priorities. 

1.6.2. Resource utilization 

The resources used by a process are returned by a getrusage call, returning information in 
a structure defined in <sys/resource.h>: 

^define RUSAGE_SELF 0 /* usage by this process */ 

^define RUSAGE_CHILDREN -1/* usage by all children */ 

getrusage(who, rusage) 

int who; result struct rusage *rusage; 


struct rusage ( 
struct 

timeval ru_utime; 

/* user time used */ 

struct 

timeval ru_stime; 

/* system time used */ 

int 

ru_maxrss; 

/* maximum core resident set size: kbytes */ 

int 

ru_ixrss; 

/* integral shared memory size (kbytes*sec) */ 

int 

ru_idrss; 

/* unshared data " */ 

int 

ru_isrss; 

/* unshared stack " */ 

int 

ru_minflt; 

/* page-reclaims */ 

int 

ru_majflt; 

/* page faults */ 

int 

ru_nswap; 

/* swaps */ 

int 

ru_inblock; 

/* block input operations */ 

int 

ru_oublock; 

/* .block output ” */ 

int 

ru_msgsnd; 

/* messages sent */ 

int 

ru_msgrcv; 

/* messages received */ 

int 

ru_nsignals; 

/* signals received */ 

int 

ru_nvcsw; 

/* voluntary context switches */ 

int 

ru_nivcsw; 

/* involuntary " */ 


The who parameter specifies whose resource usage is to be returned. The resources used by 
the current process, or by all the terminated children of the current process may be requested. 
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1.6.3. Resource limits 



The resources of a process for which limits are controlled by the kernel are defined in 
<sys/resource.h>, and controlled by the getrlimit and setrlimit calls: 


#define RLIMIT_CPU 0 /* cpu time in milliseconds */ 

#define RLIMIT_FSIZE 1 /* maximum file size */ 

#define RLIMIT_DATA 2 /* maximum data segment size */ 

^define RLIMIT_STACK 3 /* maximum stack segment size */ 

^define RLIMIT_CORE 4 /* maximum core file size */ 

^define RLIMIT_RSS 5 /* maximum resident set size */ 


#define RLIM_NLIMITS 6 


^define RLIM_INFINITY 0x7fffffff 


struct rlimit { 


int rlim cur; 

int rlim max; 


/* current (soft) limit */ 
/* hard limit */ 



getrlimit(resource, rip) 

int resource; result struct rlimit *rlp; 

setrlimit(resource, rip) 

int resource; struct rlimit *rlp; 

Only the super-user can raise the maximum limits. Other users may only alter rlim _ cur 

within the range from 0 to rlim _ max or (irreversibly) lower rlim _ max. 
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1.7. System operation support 


Unless noted otherwise, the calls in this section are permitted only to a privileged user. 

1.7.1. Bootstrap operations 
The call 

mount(blkdev, dir, ronly); 
char *blkdev, *dir; int ronly; 

extends the UNIX name space. The mount call specifies a block device blkdev containing a 
UNIX file system to be made available starting at dir . If ronly is set then the file system is 
read-only; writes to the file system will not be permitted and access times will not be updated 
when files are referenced. Dir is normally a name in the root directory. 

The call 

swapon(blkdev, size); 
char *blkdev; int size; 

specifies a device to be made available for paging and swapping. 

1.7.2. Shutdown operations 
The call 

unmount (dir); 
char *dir; 

unmounts the file system mounted on dir . This call will succeed only if the file system is not 
currently being used. 

The call 
sync(); 

schedules input/output to clean all system buffer caches. (This call does not require priveleged 
status.) 

The call 

reboot(how) 
int how; 

causes a machine halt or reboot. The call may request a reboot by specifying how as 

RB AUTOBOOT, or that the machine be halted with RB HALT. These constants are 

defined in <sys/reboot.h>. 

1.7.3. Accounting 

The system optionally keeps an accounting record in a file for each process that exits on 
the system. The format of this record is beyond the scope of this document. The accounting 
may be enabled to a file name by doing 

acct(path); 
char *path; 

If path is null, then accounting is disabled. Otherwise, the named file becomes the accounting 
file. 
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2. System facilities 

This section discusses the system facilities that are not considered part of the kernel. 

The system abstractions described are: 

Directory contexts 

A directory context is a position in the UNIX file system name space. Operations on files 
and other named objects in a file system are always specified relative to such a context. 

Files 

Files are used to store uninterpreted sequence of bytes on which random access reads and 
writes may occur. Pages from files may also be mapped into process address space. A 
directory may be read as a filet • 

Communications domains 

A communications domain represents an interprocess communications environment, such 
as the communications facilities of the UNIX system, communications in the INTER¬ 
NET, or the resource sharing protocols and access rights of a resource sharing system on 
a local network. 

Sockets 

A socket is an endpoint of communication and the focal point for IPC in a communica¬ 
tions domain. Sockets may be created in pairs, or given names and used to rendezvous 
with other sockets in a communications domain, accepting connections from these sock¬ 
ets or exchanging messages with them. These operations model a labeled or unlabeled 
communications graph, and can be used in a wide variety of communications domains. 
Sockets can have different types to provide different semantics of communication, 
increasing the flexibility of the model. 

Terminals and other devices 

Devices include terminals, providing input editing and interrupt generation and output 
flow control and editing, magnetic tapes, disks and other peripherals. They often support 
the generic read and write operations as well as a number of ioctls. 

Processes 

Process descriptors provide facilities for control and debugging of other processes. 


t Support for mapping files is not included in the 4.2 release. 
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2.1. Generic operations 


Many system abstractions support the operations read, write and ioctl. We describe the 
basics of these common primitives here. Similarly, the mechanisms whereby normally syn¬ 
chronous operations may occur in a non-blocking or asynchronous fashion are common to all 
system-defined abstractions and are described here. 

2.1.1. Read and write 

The read and write system calls can be applied to communications channels, files, termi¬ 
nals and devices. They have the form: 

cc = readffd, buf, nbytes); 

result int cc; int fd; result caddr_t buf; int nbytes; 

cc = write(fd, buf, nbytes); 

result int cc; int fd; caddr_t buf; int nbytes; 

The read call transfers as much data as possible from the object defined by fd to the buffer at 
address buf of size nbytes. The number of bytes transferred is returned in cc, which is -1 if a 
return occurred before any data was transferred because of an error or use of non-blocking 
operations. 

The write call transfers data from the buffer to the object defined by fd. Depending on 
the type of fd, it is possible that the write call will accept some portion of the provided bytes; 
the user should resubmit the other bytes in a later request in this case. Error returns because 
of interrupted or otherwise incomplete operations are possible. 

Scattering of data on input or gathering of data for output is also possible using an array 
of input/output vector descriptors. The type for the descriptors is defined in <sys/uio.h> as: 

struct iovec { 

caddr t iov_msg; 
int iov_len; 

I; 

The calls using an array of descriptors are: 

cc = readvffd, iov, iovlen); 
result int cc; int fd; struct iovec *iov; int iovlen; 

cc = writevffd, iov, iovlen); 

result int cc; int fd; struct iovec *iov; int iovlen; 

Here iovlen is the count of elements in the iov array. 

2.1.2. Input/output control 

Control operations on an object are performed by the ioctl operation: 

ioctlffd, request, buffer); 

int fd, request; caddr_t buffer; 

This operation causes the specified request to be performed on the object fd. The request 
parameter specifies whether the argument buffer is to be read, written, read and written, or is 
not needed, and also the size of the buffer, as well as the request. Different descriptor types 
and subtypes within descriptor types may use distinct ioctl requests. For example, operations 
on terminals control flushing of input and output queues and setting of terminal parameters; 
operations on disks cause formatting operations to occur; operations on tapes control tape 
positioning. 


/* base of a component */ 

/* length of a component */ 
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The names for basic control operations are defined in <sys/ioctl.h>. 

2.1.3. Non-blocking and asynchronous operations 

A process that wishes to do non-blocking operations on one of its descriptors sets the 
descriptor in non-blocking mode as described in section 1.5.4. Thereafter the read call will 
return a specific EWOULDBLOCK error indication if there is no data to be read. The process 
may dselect the associated descriptor to determine when a read is possible. 

Output attempted when a descriptor can accept less than is requested will either accept 
some of the provided data, returning a shorter than normal length, or return an error indicat¬ 
ing that the operation would block. More output can be performed as soon as a select call indi¬ 
cates the object is writeable. 

Operations other than data input or output may be performed on a descriptor in a non- 
blocking fashion. These operations will return with a characteristic error indicating that they 
are in progress if they cannot return immediately. The descriptor may then be selected for 
write to find out when the operation can be retried. When select indicates the descriptor is 
writeable, a respecification of the original operation will return the result of the operation. 
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2.2. File system 


2.2.1. Overview 

The file system abstraction provides access to a hierarchical file system structure. The 
file system contains directories (each of which may contain other sub-directories) as well as 
files and references to other objects such as devices and inter-process communications sockets. 

Each file is organized as a linear array of bytes. No record boundaries or system related 
information is present in a file. Files may be read and written in a random-access fashion. 
The user may read the data in a directory as though it were an ordinary file to determine the 
names of the contained files, but only the system may write into the directories. The file sys¬ 
tem stores only a small amount of ownership, protection and usage information with a file. 

2.2.2. Naming 

The file system calls take path name arguments. These consist of a zero or more com¬ 
ponent file names separated by characters, where each file name is up to 255 ASCII charac¬ 
ters excluding null and “/”. 

Each process always has two naming contexts: one for the root directory of the file sys¬ 
tem and one for the current working directory. These are used by the system in the filename 
translation process. If a path name begins with a “/”, it is called a full path name and inter¬ 
preted relative to the root directory context. If the path name does not begin with a it is 
called a relative path name and interpreted relative to the current directory context. 

The system limits the total length of a path name to 1024 characters. 

The file name in each directory refers to the parent directory of that directory. The 
parent directory of a file system is always the systems root directory. 

The calls 

chdir(path); 
char *path; 

chroot(path) 
char *path; 

change the current working directory and root directory context of a process. Only the super- 
user can change the root directory context of a process. 

2.2.3. Creation and removal 

The file system allows directories, files, special devices, and “portals” to be created and 
removed from the file system. 

2.2.3.1. Directory creation and removal 

A directory is created with the mkdir system call: 

mkdir(path, mode); 
char *path; int mode; 

and removed with the rmdir system call: 

rmdir(path); 
char *path; 

A directory must be empty if it is to be deleted. 
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2.2.3.2. File creation 

Files are created with the open system call, 

fd = open(path, oflag, mode); 

result int fd; char *path; int oflag, mode; 

The path parameter specifies the name of the file to be created. The oflag parameter must 

include 0_CREAT from below to cause the file to be created. The protection for the new file 

is specified in mode. Bits for oflag are defined in <sys/file.h>: 


^define 

0 

RDONLY 

000 

/* open for reading */ 

^define 

0 

WRONLY 

001 

/* open for writing */ 

^define 

0 

RDWR 

002 

/* open for read & write */ 

^define 

0 

NDELAY 

004 

/* non-blocking open */ 

^define 

0 

APPEND 

010 

/* append on each write */ 

#define 

0 

CREAT 

01000 

/* open with file create */ 

#define 

0 

TRUNC 

02000 

/* open with truncation */ 

#define 

0 

EXCL 

04000 

/* error on create if file exists */ 


One of 0_RDONLY, 0_WRONLY and 0_RDWR should be specified, indicating what 

types of operations are desired to be performed on the open file. The operations will be 
checked against the user’s access rights to the file before allowing the open to succeed. Speci¬ 
fying 0_APPEND causes writes to automatically append to the file. The flag 0_CREAT 

causes the file to be created if it does not exist, with the specified mode , owned by the current 
user and the group of the containing directory. 

If the open specifies to create the file with 0_EXCL and the file already exists, then the 

open will fail without affecting the file in any way. This provides a simple exclusive access 
facility. 

2.2.3.3. Creating references to devices 

The file system allows entries which reference peripheral devices. Peripherals are dis¬ 
tinguished as block or character devices according by their ability to support block-oriented 
operations. Devices are identified by their ‘‘major” and “minor” device numbers. The major 
device number determines the kind of peripheral it is, while the minor device number indicates 
one of possibly many peripherals of that kind. Structured devices have all operations per¬ 
formed internally in “block” quantities while unstructured devices often have a number of spe¬ 
cial ioctl operations, and may have input and output performed in large units. The mknod call 
creates special entries: 

mknod(path, mode, dev); 
char ’“path; int mode, dev; 

where mode is formed from the object type and access permissions. The parameter dev is a 
configuration dependent parameter used to identify specific character or block i/o devices. 

2.2.3.4. Portal creationf 
The call 

fd = portal(name, server, param, dtype, protocol, domain, socktype) 
result int fd; char *name, *server, *param; int dtype, protocol; 
int domain, socktype; 

places a name in the file system name space that causes connection to a server process when 
the name is used. The portal call returns an active portal in fd as though an access had 
occurred to activate an inactive portal, as now described. 

t The portal call is not implemented in 4.2BSD. 
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When an inactive portal is accesseed, the system sets up a socket of the specified socktype 
in the specified communications domain (see section 2.3), and creates the server process, giving 
it the specified param as argument to help it identify the portal, and also giving it the newly 
created socket as descriptor number 0. The accessor of the portal will create a socket in the 
same domain and connect to the server. The user will then wrap the socket in the specified 
protocol to create an object of the required descriptor type dtype and proceed with the opera¬ 
tion which was in progress before the portal was encountered. 

While the server process holds the socket (which it received as fd from the portal call on 
descriptor 0 at activation) further references will result in connections being made to the same 
socket. 

2.2.3.5. File, device, and portal removal 

A reference to a file, special device or portal may be removed with the unlink call, 

unlink(path); 
char *path; 

The caller must have write access to the directory in which the file is located for this call to be 
successful. 

2.2.4. Reading and modifying file attributes 

Detailed information about the attributes of a file may be obtained with the calls: 

^include <sys/stat.h> 
stat(path, stb); 

char *path; result struct stat *stb; 
fstat(fd, stb); 

int fd; result struct stat ’“stb; 

The stat structure includes the file type, protection, ownership, access times, size, and a count 
of hard links. If the file is a symbolic link, then the status of the link itself (rather than the 
file the link references) may be found using the Istat call: 

lstat(path, stb); 

char *path; result struct stat *stb; 

Newly created files are assigned the user id of the process that created it and the group id 
of the directory in which it was created. The ownership of a file may be changed by either of 
the calls 

chown(path, owner, group); 
char *path; int owner, group; 

fchown(fd, owner, group); 
int fd, owner, group; 

In addition to ownership, each file has three levels of access protection associated with it. 
These levels are owner relative, group relative, and global (all users and groups). Each level of 
access has separate indicators for read permission, write permission, and execute permission. 
The protection bits associated with a file may be set by either of the calls: 
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chmod(path, mode); 
char *path; int mode; 

fchmod(fd, mode); 
int fd, mode; 

where mode is a value indicating the new protection of the file. The file mode is a three digit 
octal number. Each digit encodes read access as 4, write access as 2 and execute access as 1, 
or’ed together. The 0700 bits describe owner access, the 070 bits describe the access rights for 
processes in the same group as the file, and the 07 bits describe the access rights for other 
processes. 

Finally, the access and modify times on a file may be set by the call: 

utimes(path, tvp) 

char *path; struct timeval *tvp[2]; 

This is particularly useful when moving files between media, to preserve relationships between 
the times the file was modified. 

2.2.5. Links and renaming 

Links allow multiple names for a file to exist. Links exist independently of the file linked 
to. 

Two types of links exist, hard links and symbolic links. A hard link is a reference count¬ 
ing mechanism that allows a file to have multiple names within the same file system. Symbolic 
links cause string substitution during the pathname interpretation process. 

Hard links and symbolic links have different properties. A hard link insures the target 
file will always be accessible, even after its original directory entry is removed; no such guaran¬ 
tee exists for a symbolic link. Symbolic links can span file systems boundaries. 

The following calls create a new link, named path2 y to pathl : 

link(pathl, path2); 
char *pathl, *path2; 

symlink(pathl, path2); 
char *pathl, *path2; 

The unlink primitive may be used to remove either type of link. 

If a file is a symbolic link, the ‘Value” of the link may be read with the readlink call, 

len = readlink(path, buf, bufsize); 

result int len; result char *path, *buf; int bufsize; 

This call returns, in buf , the null-terminated string substituted into pathnames passing through 
path. 

Atomic renaming of file system resident objects is possible with the rename call: 

rename(oldname, newname); 
char *oldname, *newname; 

where both oldname and newname must be in the same file system. If newname exists and is a 
directory, then it must be empty. 

2.2.6. Extension and truncation 

Files are created with zero length and may be extended simply by writing or appending to 
them. While a file is open the system maintains a pointer into the file indicating the current 
location in the file associated with the descriptor. This pointer may be moved about in the file 





in a random access fashion. To set the current offset into a file, the Iseek call may be used, 

oldoffset = lseek(fd, offset, type); 

result off_t oldoffset; int fd; off_t offset; int type; 

where type is given in <sys/file.h> as one of, 

#define L_SET 0 /* set absolute file offset */ 

#define L__INCR 1 /* set file offset relative to current position */ 

#define L_XTND 2 /* set offset relative to end-of-file */ 

The call “lseek(fd, 0, L_INCR)” returns the current offset into the file. 

Files may have “holes” in them. Holes are void areas in the linear extent of the file 
where data has never been written. These may be created by seeking to a location in a file 
past the current end-of-file and writing. Holes are treated by the system as zero valued bytes. 

A file may be truncated with either of the calls: 

truncate(path, length); 
char *path; int length; 


ftruncate(fd, length); 
int fd, length; 

reducing the size of the specified file to length bytes. 


2.2.7. Checking accessibility 

A process running with different real and effective user ids may interrogate the accessibil¬ 
ity of a file to the real user by using the access call: 


accessible = access(path, how); 
result int accessible; char *path; int how; 

Here how is constructed by or’ing the following bits, defined in <sys/file.h>: 


#define F_OK 
#define X_OK 

^define W_OK 

#define R_OK 


0 /* file exists */ 

1 /* file is executable */ 

2 /* file is writable */ 

4 /* file is readable */ 


The presence or absence of advisory locks does not affect the result of access. 


2.2.8. Locking 

The file system provides basic facilities that allow cooperating processes to synchronize 
their access to shared files. A process may place an advisory read or write lock on a file, so 
that other cooperating processes may avoid interfering with the process’ access. This simple 
mechanism provides locking with file granularity. More granular locking can be built using the 
IPC facilities to provide a lock manager. The system does not force processes to obey the 
locks; they are of an advisory nature only. 

Locking is performed after an open call by applying the flock primitive, 

flock(fd, how); 
int fd, how; 

where the how parameter is formed from bits defined in <sys/file.h>: 


#define LOCK_SH 
#define LOCK_EX 

^define LOCK_NB 

#define LOCK_UN 


1 /* shared lock */ 

2 /* exclusive lock */ 

4 /* don’t block when locking */ 

8 /* unlock */ 
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Successive lock calls may be used to increase or decrease the level of locking. If an object is 
currently locked by another process when a flock call is made, the caller will be blocked until 
the current lock owner releases the lock; this may be avoided by including LOCK_NB in the 

how parameter. Specifying LOCK_UN removes all locks associated with the descriptor. 

Advisory locks held by a process are automatically deleted when the process terminates. 

2.2.9. Disk quotas 

As an optional facility, each file system may be requested to impose limits on a user’s 
disk usage. Two quantities are limited: the total amount of disk space which a user may allo¬ 
cate in a file system and the total number of files a user may create in a file system. Quotas 
are expressed as hard limits and soft limits. A hard limit is always imposed; if a user would 
exceed a hard limit, the operation which caused the resource request will fail. A soft limit 
results in the user receiving a warning message, but with allocation succeeding. Facilities are 
provided to turn soft limits into hard limits if a user has exceeded a soft limit for an unreason¬ 
able period of time. 

To enable disk quotas on a file system the setquota call is used: 

setquota(special, file) 
char *special, *file; 

where special refers to a structured device file where a mounted file system exists, and file 
refers to a disk quota file (residing on the file system associated with special ) from which user 
quotas should be obtained. The format of the disk quota file is implementation dependent. 

To manipulate disk quotas the quota call is provided: 

^include <sys/quota.h> 

quota(cmd, uid, arg, addr) 
int cmd, uid, arg; caddr_t addr; 

The indicated cmd is applied to the user ID uid. The parameters arg and addr are command 
specific. The file <sys/quota.h> contains definitions pertinent to the use of this call. 
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2.3. Interprocess communications 


2.3.1. Interprocess communication primitives 

2.3.1.1. Communication domains 

The system provides access to an extensible set of communication domains. A communi¬ 
cation domain is identified by a manifest constant defined in the file <sys/socket.h>. Impor¬ 
tant standard domains supported by the system are the “unix” domain, AF_UNIX, for com¬ 

munication within the system, and the “internet” domain for communication in the DARPA 
internet, AF_INET. Other domains can be added to the system. 


2.3.1.2. Socket types and protocols 

Within a domain, communication takes place between communication endpoints known 
as sockets. Each socket has the potential to exchange information with other sockets within 
the domain. 


Each socket has an associated abstract type, which describes the semantics of communi¬ 
cation using that socket. Properties such as reliability, ordering, and prevention of duplication 
of messages are determined by the type. The basic set of socket types is defined in 
<sys/socket.h>: 


/* Standard socket types */ 

^define SOCK_DGRAM 1 

^define SOCK_STREAM 2 

^define SOCK_RAW 3 

^define SOCK_RDM 4 

^define SOCK_SEQPACKET 5 


/* datagram */ 

/* virtual circuit */ 

/* raw socket */ 

/* reliably-delivered message */ 
/* sequenced packets */ 


The SOCK_DGRAM type models the semantics of datagrams in network communication: 

messages may be lost or duplicated and may arrive out-of-order. The SOCK_RDM type 

models the semantics of reliable datagrams: messages arrive unduplicated and in-order, the 
sender is notified if messages are lost. The send and receive operations (described below) gen¬ 
erate reliable/unreliable datagrams. The SOCK_STREAM type models connection-based vir¬ 
tual circuits: two-way byte streams with no record boundaries. The SOCK_SEQPACKET 

type models a connection-based, full-duplex, reliable, sequenced packet exchange; the sender is 
notified if messages are lost, and messages are never duplicated or presented out-of-order. 
Users of the last two abstractions may use the facilities for out-of-band transmission to send 


out-of-band data. 


SOCK_RAW is used for unprocessed access to internal network layers and interfaces; it 

has no specific semantics. 


Other socket types can be defined.t 

Each socket may have a concrete protocol associated with it. This protocol is used within 
the domain to provide the semantics required by the socket type. For example, within the 

“internet” domain, the SOCK_DGRAM type may be implemented by the UDP user datagram 

protocol, and the SOCK_STREAM type may be implemented by the TCP transmission con¬ 
trol protocol, while no standard protocols to provide SOCK_RDM or SOCK_SEQPACKET 

sockets exist. 


2.3.1.3. Socket creation, naming and service establishment 

Sockets may be connected or unconnected. An unconnected socket descriptor is obtained 
by the socket call: 

t 4.2BSD does not support the SOCK_RDM and SOCK_SEQPACKET types. 
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s = socket(domain, type, protocol); 
result int s; int domain, type, protocol; 

An unconnected socket descriptor may yield a connected socket descriptor in one of two 
ways: either by actively connecting to another socket, or by becoming associated with a name 
in the communications domain and accepting a connection from another socket. 

To accept connections, a socket must first have a binding to a name within the communi¬ 
cations domain. Such a binding is established by a bind call: 

bind(s, name, namelen); 

int s; char *name; int namelen; 

A socket’s bound name may be retrieved with a getsockname call: 

getsockname(s, name, namelen); 

int s; result caddr_t name; result int *namelen; 

while the peer’s name can be retrieved with getpeername: 

getpeername(s, name, namelen); 

int s; result caddr_t name; result int *namelen; 

Domains may support sockets with several names. 

2.3.1.4. Accepting connections 

Once a binding is made, it is possible to listen for connections: 

listen(s, backlog); 
int s, backlog; 

The backlog specifies the maximum count of connections that can be simultaneously queued 
awaiting acceptance. 

An accept call: 

t = accept(s, name, anamelen); 

result int t; int s; result caddr_t name; result int *anamelen; 

returns a descriptor for a new, connected, socket from the queue of pending connections on s. 

2.3.1.5. Making connections 

An active connection to a named socket is made by the connect call: 

connects, name, namelen); 

int s; caddr_t name; int namelen; 

It is also possible to create connected pairs of sockets without using the domain’s name 
space to rendezvous; this is done with the socketpair calif: 

socketpair(d, type, protocol, sv); 
int d, type, protocol; result int sv[2]; 

Here the returned sv descriptors correspond to those obtained with accept and connect. 

The call 

pipe(pv) 
result int pv[2]; 

creates a pair of SOCK_STREAM sockets in the UNIX domain, with pv[0] only writeable 

t 4.2BSD supports socketpair creation only in the “unix” communication domain. 
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and pv[l] only readable. 

2.3.1.6. Sending and receiving data 

Messages may be sent from a socket by: 
cc = sendto(s, buf, len, flags, to, tolen); 

result int cc; int s; caddr_t buf; int len, flags; caddr_t to; int tolen; 

if the socket is not connected or: 

cc = send(s, buf, len, flags); 

result int cc; int s; caddr_t buf; int len, flags; 

if the socket is connected. The corresponding receive primitives are: 

msglen = recvfrom(s, buf, len, flags, from, fromlenaddr); 

result int msglen; int s; result caddr_t buf; int len, flags; 

result caddr_t from; result int * fromlenaddr; 

and 

msglen = recv(s, buf, len, flags); 

result int msglen; int s; result caddr_t buf; int len, flags; 

In the unconnected case, the parameters to and tolen specify the destination or source of 
the message, while the from parameter stores the source of the message, and *fromlenaddr ini¬ 
tially gives the size of the from buffer and is updated to reflect the true length of the from 
address. 

All calls cause the message to be received in or sent from the message buffer of length len 
bytes, starting at address buf. The flags specify peeking at a message without reading it or 
sending or receiving high-priority out-of-band messages, as follows: 

#define MSG PEEK Oxl /* peek at incoming message */ 

#define MSG OOB 0x2 /* process out-of-band data */ 

2.3.1.7. Scatter/gather and exchanging access rights 

It is possible scatter and gather data and to exchange access rights with messages. When 
either of these operations is involved, the number of parameters to the call becomes large. 
Thus the system defines a message header structure, in <sys/socket.h>, which can be used to 
conveniently contain the parameters to the calls: 

struct msghdr { 


caddr_t 

msg_name; 

/* optional address */ 

int 

msg_namelen; 

/* size of address */ 

struct 

iov *msg_iov; 

/* scatter/gather array */ 

int 

msg_iovlen; 

/* # elements in msg_iov */ 

caddr_t 

msg_accrights; 

/* access rights sent/received */ 

int 

msg_accrightslen; 

/* size of msg_accrights */ 


Here msg _ name and msg _ namelen specify the source or destination address if the socket is 

unconnected; msg _ name may be given as a null pointer if no names are desired or required. 

The msg _ iov and msg _ iovlen describe the scatter/gather locations, as described in section 

2.1.3. Access rights to be sent along with the message are specified in msg _ accrights , which 

has length msg _ accrightslen. In the “unix” domain these are an array of integer descriptors, 

taken from the sending process and duplicated in the receiver. 

This structure is used in the operations sendmsg and recvmsg: 
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sendmsg(s, msg, flags); 

int s; struct msghdr *msg; int flags; 

msglen = recvmsg(s, msg, flags); 

result int msglen; int s; result struct msghdr *msg; int flags; 

2.3.1.8. Using read and write with sockets 

The normal UNIX read and write calls may be applied to connected sockets and 
translated into send and receive calls from or to a single area of memory and discarding any 
rights received. A process may operate on a virtual circuit socket, a terminal or a file with 
blocking or non-blocking input/output operations without distinguishing the descriptor type. 

2.3.1.9. Shutting down halves of full-duplex connections 

A process that has a full-duplex socket such as a virtual circuit and no longer wishes to 
read from or write to this socket can give the call: 

shutdowns, direction); 
int s, direction; 

where direction is 0 to not read further, 1 to not write further, or 2 to completely shut the con¬ 
nection down. 

2.3.1.10. Socket and protocol options 

Sockets, and their underlying communication protocols, may support options. These 
options may be used to manipulate implementation specific or non-standard facilities. The get- 
sockopt and setsockopt calls are used to control options: 

getsockopt(s, level, optname, optval, optlen) 

int s, level, optname; result caddr_t optval; result int *optlen; 

setsockopt(s, level, optname, optval, optlen) 
int s, level, optname; caddr_t optval; int optlen; 

The option optname is interpreted at the indicated protocol level for socket s. If a value is 
specified with optval and optlen , it is interpreted by the software operating at the specified 
level. The level SOL_SOCKET is reserved to indicate options maintained by the socket facili¬ 

ties. Other level values indicate a particular protocol which is to act on the option request; 
these values are normally interpreted as a “protocol number”. 

2.3.2. UNIX domain 

This section describes briefly the properties of the UNIX communications domain. 

2.3.2.1. Types of sockets 

In the UNIX domain, the SOCK_STREAM abstraction provides pipe-like facilities, 

while SOCK_DGRAM provides (usually) reliable message-style communications. 

2.3.2.2. Naming 

Socket names are strings and may appear in the UNIX file system name space through 
portals f. 


t The 4.2BSD implementation of the UNIX domain embeds bound sockets in the UNIX file system name 
space; this is a side effect of the implementation. 
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2.3.2.3. Access rights transmission 

The ability to pass UNIX descriptors with messages in this domain allows migration of 
service within the system and allows user processes to be used in building system facilities. 

2.3.3. INTERNET domain 

This section describes briefly how the INTERNET domain is mapped to the model 
described in this section. More information will be found in the document describing the net¬ 
work implementation in 4.2BSD. 

2.3.3.1. Socket types and protocols 

SOCK_STREAM is supported by the INTERNET TCP protocol; SOCK_DGRAM by 

the UDP protocol. The SOCK_SEQPACKET has no direct INTERNET family analogue; a 
protocol based on one from the XEROX NS family and layered on top of IP could be imple¬ 
mented to fill this gap. 

2.3.3.2. Socket naming 

Sockets in the INTERNET domain have names composed of the 32 bit internet address, 
and a 16 bit port number. Options may be used to provide source routing for the address, 
security options, or additional address for subnets of INTERNET for which the basic 32 bit 
addresses are insufficient. 

2.3.3.3. Access rights transmission 

No access rights transmission facilities are provided in the INTERNET domain. 

2.3.3.4. Raw access 

The INTERNET domain allows the super-user access to the raw facilities of the various 
network interfaces and the various internal layers of the protocol implementation. This allows 
administrative and debugging functions to occur. These interfaces are modeled as 
SOCK RAW sockets. 
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2.4. Terminals and Devices 


2.4.1. Terminals 

Terminals support read and write i/o operations, as well as a collection of terminal 
specific ioctl operations, to control input character editing, and output delays. 

2.4.1.1. Terminal input 

Terminals are handled according to the underlying communication characteristics such as 
baud rate and required delays, and a set of software parameters. 

2.4.1.1.1. Input modes 

A terminal is in one of three possible modes: raw y cbreak , or cooked. In raw mode all 
input is passed through to the reading process immediately and without interpretation. In 
cbreak mode, the handler interprets input only by looking for characters that cause interrupts 
or output flow control; all other characters are made available as in raw mode. In cooked 
mode, input is processed to provide standard line-oriented local editing functions, and input is 
presented on a line-by-line basis. 

2.4.1.1.2. Interrupt characters 

Interrupt characters are interpreted by the terminal handler only in cbreak and cooked 
modes, and cause a software interrupt to be sent to all processes in the process group associ¬ 
ated with the terminal. Interrupt characters exist to send SIGINT and SIGQUIT signals, and 
to stop a process group with the SIGTSTP signal either immediately, or when all input up to 
the stop character has been read. 

2.4.1.1.3. Line editing 

When the terminal is in cooked mode, editing of an input line is performed. Editing 
facilities allow deletion of the previous character or word, or deletion of the current input line. 
In addition, a special character may be used to reprint the current input line after some 
number of editing operations have been applied. 

Certain other characters are interpreted specially when a process is in cooked mode. The 
end of line character determines the end of an input record. The end of file character simu¬ 
lates an end of file occurrence on terminal input. Flow control is provided by stop output and 
start output control characters. Output may be flushed with the flush output character; and a 
literal character may be used to force literal input of the immediately following character in the 
input line. 

2.4.1.2. Terminal output 

On output, the terminal handler provides some simple formatting services. These include 
converting the carriage return character to the two character return-linefeed sequence, display¬ 
ing non-graphic ASCII characters as “"character”, inserting delays after certain standard con¬ 
trol characters, expanding tabs, and providing translations for upper-case only terminals. 

2.4.1.3. Terminal control operations 

When a terminal is first opened it is initialized to a standard state and configured with a 
set of standard control, editing, and interrupt characters. A process may alter this 
configuration with certain control operations, specifying parameters in a standard structure: 


257 







struct ttymode { 
short 

tt_ispeed; 

/* input speed */ 

int 

tt_iflags; 

/* input flags */ 

short 

tt_ospeed; 

/* output speed */ 

int 

tt_oflags; 

/* output flags */ 


and “special characters” are specified with the ttychars structure, 


struct ttychars { 


char 

tc_erasec; 

char 

tc_killc; 

char 

tc_intrc; 

char 

tc_quite; 

char 

tc_startc; 

char 

tc_stopc; 

char 

tc_eofc; 

char 

tc_brkc; 

char 

tc_suspc; 

char 

tc_dsuspc; 

char 

tc_rprntc; 

char 

tc_flushc; 

char 

tc_werasc; 

char 

tc_lnextc; 


/* erase char */ 

/* erase line */ 

/* interrupt */ 

/* quit */ 

/* start output */ 

/* stop output */ 

/* end-of-file */ 

/* input delimiter (like nl) */ 

/* stop process signal */ 

/* delayed stop process signal */ 
/* reprint line */ 

/* flush output (toggles) */ 

/* word erase */ 

/* literal next character */ 


2.4.1.4. Terminal hardware support 

The terminal handler allows a user to access basic hardware related functions; e.g. line 
speed, modem control, parity, and stop bits. A special signal, SIGHUP, is automatically sent 
to processes in a terminal’s process group when a carrier transition is detected. This is nor¬ 
mally associated with a user hanging up on a modem controlled terminal line. 

2.4.2. Structured devices 

Structures devices are typified by disks and magnetic tapes, but may represent any 
random-access device. The system performs read-modify-write type buffering actions on block 
devices to allow them to be read and written in a totally random access fashion like ordinary 
files. File systems are normally created in block devices. 


2.4.3. Unstructured devices 

Unstructured devices are those devices which do not support block structure. Familiar 
unstructured devices are raw communications lines (with no terminal handler), raster plotters, 
magnetic tape and disks unfettered by buffering and permitting large block input/output and 
positioning and formatting commands. 
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2.5. Process and kernel descriptors 

The status of the facilities in this section is still under discussion. The ptrcicc facility of 
4.1BSD is provided in 4.2BSD. Planned enhancements would allow a descriptor based process 
control facility. 
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I. Summary of facilities 


1. Kernel primitives 

1.1. Process naming and protection 

sethostid 

gethostid 

sethostname 

gethostname 

getpid 

fork 

exit 

execve 

getuid 

geteuid 

setreuid 

getgid 

getegid 

getgroups 

setregid 

setgroups 

getpgrp 

setpgrp 

1.2 Memory management 

<mman.h> 

sbrk 

sstkt 

getpagesize 

mmapt 

mremapt 

munmapt 

mprotectt 

madvisef 

mincoret 

1.3 Signals 

<signal.h> 

sigvec 

kill 

killpgrp 

sigblock 

sigsetmask 

sigpause 

sigstack 

1.4 Timing and statistics 

<sys/time.h> 

gettimeofday 

settimeofday 

getitimer 

setitimer 

t Not supported in 4.2BSD. 


set UNIX host id 

get UNIX host id 

set UNIX host name 

get UNIX host name 

get process id 

create new process 

terminate a process 

execute a different process 

get user id 

get effective user id 

set real and effective user id’s 

get accounting group id 

get effective accounting group id 

get access group set 

set real and effective group id’s 

set access group set 

get process group 

set process group 

memory management definitions 

change data section size 

change stack section size 

get memory page size 

map pages of memory 

remap pages in memory 

unmap memory 

change protection of pages 

give memory management advice 

determine core residency of pages 


signal definitions 
set handler for signal 
send signal to process 
send signal to process group 
block set of signals 
restore set of blocked signals 
wait for signals 
set software stack for signals 


time-related definitions 
get current time and timezone 
set current time and timezone 
read an interval timer 
get and set an interval timer 
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profil 

profile process 

1.5 Descriptors 

getdtablesize 

dup 

dup2 

close 

select 

fcntl 

wrapt 

1.6 Resource controls 

descriptor reference table size 
duplicate descriptor 
duplicate to specified index 
close descriptor 
multiplex input/output 
control descriptor options 
wrap descriptor with protocol 

<sys/resource.h> 

getpriority 

setpriority 

getrusage 

getrlimit 

setrlimit 

resource-related definitions 

get process priority 

set process priority 

get resource usage 

get resource limitations 

set resource limitations 

1.7 System operation support 


mount 

swapon 

umount 

sync 

reboot 

acct 

mount a device file system 
add a swap device 
umount a file system 
flush system caches 
reboot a machine 
specify accounting file 

2. System facilities 

2.1 Generic operations 

read 

write 

<sys/uio.h> 

readv 

writev 

<sys/ioctl.h> 

ioctl 

read data 
write data 

scatter-gather related definitions 
scattered data input 
gathered data output 
standard control operations 
device control operation 

2.2 File system 



Operations marked with a * exist in two forms: as shown, operating on a file name, and 
operating on a file descriptor, when the name is preceded with a “f \ 


<sys/file.h> 

file system definitions 

chdir 

chroot 

mkdir 

rmdir 

open 

mknod 

portalf 

unlink 

stat* 

change directory 
change root directory 
make a directory 
remove a directory 
open a new or existing file 
make a special file 
make a portal entry 
remove a link 
return status for a file 

t Not supported in 4.2BSD. 
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lstat 

chown* 

chmod* 

utimes 

link 

symlink 

readlink 

rename 

lseek 

truncate* 

access 

flock 

2.3 Communications 

<sys/socket.h> 

socket 

bind 

getsockname 

listen 

accept 

connect 

socketpair 

sendto 

send 

recvfrom 

recv 

sendmsg 

recvmsg 

shutdown 

getsockopt 

setsockopt 

2.5 Terminals, block and character 


returned status of link 
change owner 
change mode 

change access/modify times 
make a hard link 
make a symbolic link 
read contents of symbolic link 
change name of file 
reposition within file 
truncate file 
determine accessibility 
lock a file 


standard definitions 

create socket 

bind socket to name 

get socket name 

allow queueing of connections 

accept a connection 

connect to peer socket 

create pair of connected sockets 

send data to named socket 

send data to connected socket 

receive data on unconnected socket 

receive data on connected socket 

send gathered data and/or rights 

receive scattered data and/or rights 

partially close full-duplex connection 

get socket option 

set socket option 


devices 


2.4 Processes and kernel hooks 


262 




Berkeley VAX/UNIX Assembler Reference Manual 


John F. Reiser 
Bell Laboratories, 

Holmdel, NJ 

and 

Robert R. Henry 1 
Electronics Research Laboratory 
University of California 
Berkeley, CA 94720 

November 5, 1979 

Revised 

February 9, 1983 

1. Introduction 

This document describes the usage and input syntax of the UNIX VAX- 11 assembler as. 
As is designed for assembling the code produced by the “C” compiler; certain concessions 
have been made to handle code written directly by people, but in general little sympathy has 
been extended. This document is intended only for the writer of a compiler or a maintainer 
of the assembler. 

1.1. Assembler Revisions since November 5, 1979 

There has been one major change to as since the last release. As has been updated to 
assemble the new instructions and data formats for “G” and “H” floating point numbers, as 
well as the new queue instructions. 

1.2. Features Supported, but No Longer Encouraged as of February 9, 1983 

These feature(s) in as are supported, but no longer encouraged. 

The colon operator for field initialization is likely to disappear. 

2. Usage 

As is invoked with these command arguments: 
as [ -LVWJR ] [ -dn ] [ -DTS ] [ -t directory ] [ -o output ] [ name x ] • * [ name n ] 

The -L flag instructs the assembler to save labels beginning with a “L” in the symbol 
table portion of the output file. Labels are not saved by default, as the default action of the 
link editor Id is to discard them anyway. 

The -V flag tells the assembler to place its interpass temporary file into virtual memory. 
In normal circumstances, the system manager will decide where the temporary file should lie. 
Our experiments with very large temporary files show that placing the temporary file into vir¬ 
tual memory will save about 13% of the assembly time, where the size of the temporary file is 
about 350K bytes. Most assembler sources will not be this long. 

The -W turns of all warning error reporting. 

’Preparation of this paper supported in part by the National Science Foundation under grant MCS #78-07291. 
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The —J flag forces UNIX style pseudo-branch instructions with destinations further away 
than a byte displacement to be turned into jump instructions with 4 byte offsets. The -J flag 
buys you nothing if — d2 is set. (See §8.4, and future work described in §11) 

The -R flag effectively turns “.data n” directives into “.text n” directives. This obvi¬ 
ates the need to run editor scripts on assembler source to “read-only” fix initialized data seg¬ 
ments. Uninitialized data (via .lcomm and .comm directives) is still assembled into the data 
or bss segments. 

The -d flag specifies the number of bytes which the assembler should allow for a dis¬ 
placement when the value of the displacement expression is undefined in the first pass. The 
possible values of n are 1, 2, or 4; the assembler uses 4 bytes if -d is not specified. See §8.2. 

Provided the -V flag is not set, the -t flag causes the assembler to place its single tem¬ 
porary file in the directory instead of in /tmp. 

The -o flag causes the output to be placed on the file output. By default, the output of 
the assembler is placed in the file a.out in the current directory. 

The input to the assembler is normally taken from the standard input. If file arguments 
occur, then the input is taken sequentially from the files name u name 2 • • • name n This is 
not to say that the files are assembled separately; name\ is effectively concatenated to name 2 , 
so multiple definitions cannot occur amongst the input sources. 

The —D (debug), —'T (token trace), and the —S (symbol table) flags enable assembler 
trace information, provided that the assembler has been compiled with the debugging code 
enabled. The information printed is long and boring, but useful when debugging the assem¬ 
bler. 

3. Lexical conventions 

Assembler tokens include identifiers (alternatively, “symbols” or “names”), constants, 
and operators. 

3.1. Identifiers 

An identifier consists of a sequence of alphanumeric characters (including period 
underscore “ ”, and dollar “$”). The first character may not be numeric. Identifiers may be 
(practically) arbitrary long; all characters are significant. 

3.2. Constants 

3.2.1. Scalar constants 

All scalar (non floating point) constants are (potentially) 128 bits wide. Such con¬ 
stants are interpreted as two’s complement numbers. Note that 64 bit (quad words) and 
128 bit (octal word) integers are only partially supported by the VAX hardware. In addition, 
128 bit integers are only supported by the extended VAX architecture. As supports 64 and 
128 bit integers only so they can be used as immediate constants or to fill initialized data 
space. As can not perform arithmetic on constants larger than 32 bits. 

Scalar constants are initially evaluated to a full 128 bits, but are pared down by dis¬ 
carding high order copies of the sign bit and categorizing the number as a long, quad or 
octal integer. Numbers with less precision than 32 bits are treated as 32 bit quantities. 

The digits are “0123456789abcdefABCDEF” with the obvious values. 

An octal constant consists of a sequence of digits with a leading zero. 

A decimal constant consists of a sequence of digits without a leading zero. 


264 




A hexadecimal constant consists of the characters “Ox” (or “OX”) followed by a 
sequence of digits. 

A single-character constant consists of a single quote followed by an ASCII charac¬ 
ter, including ASCII newline. The constant’s value is the code for the given character. 

3.2.2. Floating Point Constants 

Floating point constants are internally represented in the VAX floating point format 
that is specified by the lexical form of the constant. Using the meta notation that [dec] is a 
decimal digit (“0123456789”), [expt] is a type specification character (“fFdDhHgG”), [expe] 
is a exponent delimiter and type specification character (“eEfFdDhHgG”), x means 0 or 
more occurences of jc, means 1 or more occurences of x , then the general lexical form of a 
floating point number is: 

0[expe] ([+-]) [dec] + (.)([dec]*) ([expt]([+-])(dec] + )) 

The standard semantic interpretation is used for the signed integer, fraction and signed 
power of 10 exponent. If the exponent delimiter is specified, it must be either an “e” or 
“E”, or must agree with the initial type specification character that is used. The type 
specification character specifies the type and representation of the constructed number, as 
follows: 

type character floating representation size (bits) 


f, F 

F format floating 

32 

d, D 

D format floating 

64 

g. G 

G format floating 

64 

h, H 

H format floating 

128 


Note that “G” and “H” format floating point numbers are not supported by all implementa¬ 
tions of the VAX architecture. As does not require the augmented architecture in order to 
run. 

The assembler uses the library routine atof() to convert “F” and “D” numbers, and 
uses its own conversion routine (derived from atof , and believed to be numerically accurate) 
to convert “G” and “H” floating point numbers. 

Collectively, all floating point numbers, together with quad and octal scalars are called 
Bignums. When as requires a Bignum, a 32 bit scalar quantity may also be used. 

3.2.3. String Constants 

A string constant is defined using the same syntax and semantics as the “C” language 
uses. Strings begin and end with a “”” (double quote). The DEC MACRO-32 assembler con¬ 
ventions for flexible string quoting is not implemented. All “C” backslash conventions are 
observed; the backslash conventions peculiar to the PDP-11 assembler are not observed. 
Strings are known by their value and their length; the assembler does not implicitly end 
strings with a null byte. 

3.3. Operators 

There are several single-character operators; see §6.1. 

3.4. Blanks 

Blank and tab characters may be interspersed freely between tokens, but may not be 
used within tokens (except character constants). A blank or tab is required to separate adja¬ 
cent identifiers or constants not otherwise separated. 





3.5. Scratch Mark Comments 

The character introduces a comment, which extends through the end of the line on 
which it appears. Comments starting in column 1, having the format “# expression string ”, 
are interpreted as an indication that the assembler is now assembling file string at line 
expression. Thus, one can use the “C” preprocessor on an assembly language source file, and 
use the ttinclude and ttdefine preprocessor directives. (Note that there may not be an assem¬ 
bler comment starting in column 1 if the assembler source is given to the “C” preprocessor, 
as it will be interpreted by the preprocessor in a way not intended.) Comments are otherwise 
ignored by the assembler. 

3.6. “C” Style Comments 

The assembler will recognize “C” style comments, introduced with the prologue /* and 
ending with the epilogue */. “C” style comments may extend across multiple lines, and are 
the preferred comment style to use if one chooses to use the “C” preprocessor. 

4. Segments and Location Counters 

Assembled code and data fall into three segments: the text segment, the data segment, 
and the bss segment. The UNIX operating system makes some assumptions about the content 
of these segments; the assembler does not. Within the text and data segments there are a 
number of sub-segments, distinguished by number (“text 0”, “text 1”, • • • “data 0”, “data 
1”, • • • ). Currently there are four subsegments each in text and data. The subsegments are 
for programming convenience only. 

Before writing the output file, the assembler zero-pads each text subsegment to a multi¬ 
ple of four bytes and then concatenates the subsegments in order to form the text segment; an 
analogous operation is done for the data segment. Requesting that the loader define symbols 
and storage regions is the only action allowed by the assembler with respect to the bss seg¬ 
ment. Assembly begins in “text 0”. 

Associated with each (sub)segment is an implicit location counter which begins at zero 
and is incremented by 1 for each byte assembled into the (sub)segment. There is no way to 
explicitly reference a location counter. Note that the location counters of subsegments other 
than “text 0” and “data 0” behave peculiarly due to the concatenation used to form the text 
and data segments. 

5. Statements 

A source program is composed of a sequence of statements. Statements are separated 
either by new-lines or by semicolons. There are two kinds of statements: null statements and 
keyword statements. Either kind of statement may be preceded by one or more labels. 

5.1. Named Global Labels 

A global label consists of a name followed by a colon. The effect of a name label is to 
assign the current value and type of the location counter to the name. An error is indicated 
in pass 1 if the name is already defined; an error is indicated in pass 2 if the value assigned 
changes the definition of the label. 

A global label is referenced by its name. 

Global labels beginning with a “L” are discarded unless the -L option is in effect. 

5.2. Numeric Local Labels 

A numeric label consists of a digit 0 to 9 followed by a colon. Such a label serves to 
define temporary symbols of the form “nb” and “nf”, where n is the digit of the label. As in 
the case of name labels, a numeric label assigns the current value and type of the location 
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counter to the temporary symbol. However, several numeric labels with the same digit may 
be used within the same assembly. References to symbols of the form “nb” refer to the first 
numeric label “n:” backwards from the reference; “nf” symbols refer to the first numeric 
label “n:” /orwards from the reference. Such numeric labels conserve the inventive powers of 
the human programmer. 

For various reasons, as turns local labels into labels of the form Ln.$m. Although 
unlikely, these generated labels may conflict with programmer defined labels. 

5.3. Null statements 

A null statement is an empty statement ignored by the assembler. A null statement 
may be labeled, however. 

5.4. Keyword statements 

A keyword statement begins with one of the many predefined keywords known to as; 
the syntax of the remainder of the statement depends on the keyword. All instruction 
opcodes are keywords. The remaining keywords are assembler pseudo-operations, also called 
directives. The pseudo-operations are listed in §8, together with the syntax they require. 

6. Expressions 

An expression is a sequence of symbols representing a value. Its constituents are 
identifiers, constants, operators, and parentheses. Each expression has a type. 

All operators in expressions are fundamentally binary in nature. Arithmetic is two’s 
complement and has 32 bits of precision. As can not do arithmetic on floating point numbers, 
quad or octal precision scalar numbers. There are four levels of precedence, listed here from 
lowest precedence level to highest: 

precedence operators 
binary +, — 

binary | ! 

binary *, /, %, 

unary —, 

All operators of the same precedence are evaluated strictly left to right, except for the 
evaluation order enforced by parenthesis. 

6.1. Expression Operators 

The operators are: 
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operator meaning 

-I- addition 

— (binary) subtraction 

* multiplication 

/ division 

% modulo 

— (unary) 2’s complement 

& bitwise and 

bitwise or 
bitwise exclusive or 
! bitwise or not 

bitwise l’s complement 
> logical right shift 

» logical right shift 

< logical left shift 

« logical left shift 

Expressions may be grouped by use of parentheses, “(” and “)”. 

6.2. Data Types 

The assembler manipulates several different types of expressions. The types likely to 
be met explicitly are: 

undefined Upon first encounter, each symbol is undefined. It may become undefined if it is 
assigned an undefined expression. It is an error to attempt to assemble an 
undefined expression in pass 2; in pass 1, it is not (except that certain keywords 
require operands which are not undefined). 

undefined external 

A symbol which is declared .globl but not defined in the current assembly is an 
undefined external. If such a symbol is declared, the link editor Id must be used 
to load the assembler’s output with another routine that defines the undefined 
reference. 

An absolute symbol is defined ultimately from a constant. Its value is unaffected 
by any possible future applications of the link-editor to the output file. 

The value of a text symbol is measured with respect to the beginning of the text 
segment of the program. If the assembler output is link-edited, its text symbols 
may change in value since the program need not be the first in the link editor’s 
output. Most text symbols are defined by appearing as labels. At the start of an 
assembly, the value of is “text 0”. 

The value of a data symbol is measured with respect to the origin of the data seg¬ 
ment of a program. Like text symbols, the value of a data symbol may change 
during a subsequent link-editor run since previously loaded programs may have 
data segments. After the first .data statement, the value of is “data 0”. 

The value of a bss symbol is measured from the beginning of the bss segment of a 
program. Like text and data symbols, the value of a bss symbol may change dur¬ 
ing a subsequent link-editor run, since previously loaded programs may have bss 
segments. 

absolute, text, data, or bss 

Symbols declared .globl but defined within an assembly as absolute, text, data, or 
bss symbols may be used exactly as if they were not declared .globl; however, 
their value and type are available to the link editor so that the program may be 


absolute 

text 

data 

bss 

external 
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loaded with others that reference these symbols, 
register The symbols 

rO rl r2 r3 r4 r5 r6 r7 r8 r9 rlO rll rl2 rl3 rl4 rl5 ap fp sp pc 

are predefined as register symbols. In addition, the “%” operator converts the 
following absolute expression whose value is between 0 and 15 into a register 
reference. 

other types 

Each keyword known to the assembler has a type which is used to select the rou¬ 
tine which processes the associated keyword statement. The behavior of such 
symbols when not used as keywords is the same as if they were absolute. 

6.3. Type Propagation in Expressions 

When operands are combined by expression operators, the result has a type which 
depends on the types of the operands and on the operator. The rules involved are complex 
to state but were intended to be sensible and predictable. For purposes of expression evalua¬ 
tion the important types are 

undefined 

absolute 

text 

data 

bss 

undefined external 
other 

The combination rules are then 

(1) If one of the operands is undefined, the result is undefined. 

(2) If both operands are absolute, the result is absolute. 

(3) If an absolute is combined with one of the “other types” mentioned above, the result 
has the other type. An “other type” combined with an explicitly discussed type other 
than absolute it acts like an absolute. 

Further rules applying to particular operators are: 

+ If one operand is text-, data-, or bss-segment relocatable, or is an undefined external, 
the result has the postulated type and the other operand must be absolute. 

— If the first operand is a relocatable text-, data-, or bss-segment symbol, the second 
operand may be absolute (in which case the result has the type of the first operand); or 
the second operand may have the same type as the first (in which case the result is 
absolute). If the first operand is external undefined, the second must be absolute. All 
other combinations are illegal. 

others 

It is illegal to apply these operators to any but absolute symbols. 

7. Pseudo-operations (Directives) 

The keywords listed below introduce directives or instructions, and influence the later 
behavior of the assembler for this statement. The metanotation 

[ stuff ] 

means that 0 or more instances of the given “stuff’ may appear. 

Boldface tokens must appear literally; words in italic words are substitutable. 
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The pseudo-operations listed below are grouped into functional categories. 

7.1. Interface to a Previous Pass 
.ABORT 

As soon as the assembler sees this directive, it ignores all further input (but it does 
read to the end of file), and aborts the assembly. No files are created. It is anticipated that 
this would be used in a pipe interconnected version of a compiler, where the first major syn¬ 
tax error would cause the compiler to issue this directive, saving unnecessary work in assem¬ 
bling code that would have to be discarded anyway. 

.file string 

This directive causes the assembler to think it is in file string, so error messages reflect 
the proper source file. 

.line expression 

This directive causes the assembler to think it is on line expression so error messages 
reflect the proper source file. 

The only effect of assembling multiple files specified in the command string is to insert 
the file and line directives, with the appropriate values, at the beginning of the source from 
each file. 

# expression string 

# expression 

This is the only instance where a comment is meaningful to the assembler. The 
must be in the first column. This meta comment causes the assembler to believe it is on line 
expression. The second argument, if included, causes the assembler to believe it is in file 
string, otherwise the current file name does not change. 

7.2. Location Counter Control 

.data [ expression ] 

.text [ expression ] 

These two pseudo-operations cause the assembler to begin assembling into the indi¬ 
cated text or data subsegment. If specified, the expression must be defined and absolute; an 
omitted expression is treated as zero. The effect of a .data directive is treated as a .text 
directive if the -R assembly flag is set. Assembly starts in the .text 0 subsegment. 

The directives .align and .org also control the placement of the location counter. 

7.3. Filled Data 

.align align expr [ , fillexpr ] 

The location counter is adjusted so that the expression lowest bits of the location 
counter become zero. This is done by assembling from 0 to 2°^” cxpr bytes, taken from the 
low order byte of fill expr. If present, fill expr must be absolute; otherwise it defaults to 0. 
Thus “.align 2” pads by null bytes to make the location counter evenly divisible by 4. The 
align _ expr must be defined, absolute, nonnegative, and less than 16. 
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Warning: the subsegment concatenation convention and the current loader conventions 
may not preserve attempts at aligning to more than 2 low-order zero bits. 

.org orgexpr [ , fillexpr ] 

The location counter is set equal to the value of org expr , which must be defined and 
absolute. The value of the org expr must be greater than the current value of the location 
counter. Space between the current value of the location counter and the desired value are 
filled with bytes taken from the low order byte of fill expr , which must be absolute and 
defaults to 0. 

.space space expr [ , fill expr ] 

The location counter is advanced by space expr bytes. Space expr must be defined 
and absolute. The space is filled in with bytes taken from the low order byte of fill expr , 
which must be defined and absolute. Fill expr defaults to 0. The .fill directive is a more 
general way to accomplish the .space directive. 

.fill rep expr , size expr , fill expr 

All three expressions must be absolute, fill expr , treated as an expression of size 
size expr bytes, is assembled and replicated rep expr times. The effect is to advance the 
current location counter rep expr * size expr bytes, size expr must be between 1 and 8. 

7.4. Symbol Definitions 

7.5. Initialized Data 


.byte 

expr [ 

, expr ] 

.word 

expr [ 

, expr ] 

.int 

expr [ 

, expr } 

.long 

expr [ 

, expr ] 


The expressions in the comma-separated list are truncated to the size indicated by the 
key word: 

keyword length (bits) 


•byte 

8 

.word 

16 

.int 

32 

.long 

32 


and assembled in successive locations. The expressions must be absolute. 

Each expression may optionally be of the form: 

expression : expression 

In this case, the value of expression is truncated to expression bits, and assembled in the 
next expression bit field which fits in the natural data size being assembled. Bits which are 
skipped because a field does not fit are filled with zeros. Thus, “.byte 123” is equivalent to 
“.byte 8:123”, and “.byte 3:1,2:1,5:1” assembles two bytes, containing the values 9 and 1. 

NB: Bit field initialization with the colon operator is likely to disappear in future 
releases of the assembler. 
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.quad number [ , number ] 

.octa number [ , number ] 

.float number [ , number ] 

.double number [ , number ] 

.ffloat number [ , number ] 

.dfloat number [ , number ] 

.gfloat number [ , number ] 

.hfloat number [ , number ] 

These initialize Bignums (see §3.2.2) in successive locations whose size is a function on 
the key word. The type of the Bignums (determined by the exponent field, or lack thereof) 
may not agree with type implied by the key word. The following table shows the key words, 
their size, and the data types for the Bignums they expect. 


keyword 

format 

length (bits) valid number( s) 

.quad 

quad scalar 

64 

scalar 

.octa 

octal scalar 

128 

scalar 

.float 

F float 

32 

F, D and scalar 

.ffloat 

F float 

32 

F, D and scalar 

.double 

D float 

64 

F, D and scalar 

.dfloat 

D float 

64 

F, D and scalar 

.gfloat 

G float 

64 

G scalar 

.hfloat 

H float 

128 

H scalar 


As will correctly perform other floating point conversions while initializing, but issues a 
warning message. As performs all floating point initializations and conversions using only 
the facilities defined in the original (native) architecture. 

.ascii string [ , string] 

.asciz string [ , string] 

Each string in the list is assembled into successive locations, with the first letter in the 
string being placed into the first location, etc. The .ascii directive will not null pad the 
string; the .asciz directive will null pad the string. (Recall that strings are known by their 
length, and need not be terminated with a null, and that the “C” conventions for escaping 
are understood.) The .ascii directive is identical to: 

.byte stringo , stringy , ■ • • 

.comm name, expression 

Provided the name is not defined elsewhere, its type is made “undefined external”, and 
its value is expression. In fact the name behaves in the current assembly just like an 
undefined external. However, the link editor Id has been special-cased so that all external 
symbols which are not otherwise defined, and which have a non-zero value, are defined to lie 
in the bss segment, and enough space is left after the symbol to hold expression bytes. 

.lcomm name, expression 

expression bytes will be allocated in the bss segment and name assigned the location of 
the first byte, but the name is not declared as global and hence will be unknown to the link 
editor. 

.globl name 
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This statement makes the name external. If it is otherwise defined (by .set or by 
appearance as a label) it acts within the assembly exactly as if the .globl statement were not 
given; however, the link editor may be used to combine this object module with other 
modules referring to this symbol. 

Conversely, if the given symbol is not defined within the current assembly, the link edi¬ 
tor can combine the output of this assembly with that of others which define the symbol. 
The assembler makes all otherwise undefined symbols external. 

.set name 9 expression 

The (name, expression) pair is entered into the symbol table. Multiple .set statements 
with the same name are legal; the most recent value replaces all previous values. 

.lsym name 9 expression 

A unique and otherwise unreferencable instance of the (name, expression) pair is 
created in the symbol table. The Fortran 77 compiler uses this mechanism to pass local 
symbol definitions to the link editor and debugger. 

.stabs string, expr x , expr 2 , expr 3 , expr A 

.stabn expr x , expr 2 , expr 3 , expr A 

.stabd expr x , expr 2 , expr 3 

The stab directives place symbols in the symbol table for the symbolic debugger, sdb 2 . 
A “stab” is a symbol table entry. The .stabs is a string stab, the .stabn is a stab not having 
a string, and the .stabd is a “dot” stab that implicitly references “dot”, the current location 
counter. 

The string in the .stabs directive is the name of a symbol. If the symbol name is zero, 
the .stabn directive may be used instead. 

The other expressions are stored in the name list structure of the symbol table and 
preserved by the loader for reference by sdb; the value of the expressions are peculiar to for¬ 
mats required by sdb. 

is used as a symbol table tag (nlist field ntype). 
seems to always be zero (nlist field n other). 

is used for either the source line number, or for a nesting level (nlist field n desc). 

is used as tag specific information (nlist field n value). In the case of the .stabd 
directive, this expression is nonexistent, and is taken to be the value of the location 
counter at the following instruction. Since there is no associated name for a .stabd 
directive, it can only be used in circumstances where the name is zero. The effect of a 
.stabd directive can be achieved by one of the other .stabx directives in the following 
manner: 

.stabn expr x , expr 2 , expr 3 , LL n 
LL n: 

The .stabd directive is preferred, because it does not clog the symbol table with labels 
used only for the stab symbol entries. 


2 Katseff, H.P. Sdb: A Symbol Debugger. Bell Laboratories, Holmdel, NJ. April 12, 1979. 

Katseff, H.P. Symbol Table Format for Sdb , File 39394, Bell Laboratories, Holmdel, NJ. March 14, 1979. 


expr x 

expr 2 

expr 3 

expr A 
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8. Machine instructions 

The syntax of machine instruction statements accepted by as is generally similar to the 
syntax of DEC MACRO-32. There are differences, however. 

8.1. Character set 

j4s uses the character “$” instead of for immediate constants, and the character 
instead of for indirection. Opcodes and register names are spelled with lower-case 

rather than upper-case letters. 

8.2. Specifying Displacement Lengths 

Under certain circumstances, the following constructs are (optionally) recognized by as 
to indicate the number of bytes to allocate for the displacement used when constructing dis¬ 
placement and displacement deferred addressing modes: 


primary 

alternate 

length 

B" 

B* 

byte (1 byte) 

W 

W" 

word (2 bytes) 

L 

L‘ 

long word (4 bytes) 


One can also use lower case b, w or 1 instead of the upper case letters. There must be 
no space between the size specifier letter and the or The constructs S* and G* are 
not recognized by as, as they are by the DEC MACRO-32 assembler. It is preferred to use the 
“"’’displacement so that the is not misinterpreted as the xor operator. 

Literal values (including floating-point literals used where the hardware expects a 
floating-point operand) are assembled as short literals if possible, hence not needing the S“ 
DEC MACRO-32 directive. 

If the displacement length modifier is present, then the displacement is always assem¬ 
bled with that displacement, even if it will fit into a smaller field, or if significance is lost. If 
the length modifier is not present, and if the value of the displacement is known exactly in 
as’s first pass, then as determines the length automatically, assembling it in the shortest pos¬ 
sible way, Otherwise, as will use the value specified by the -d argument, which defaults to 4 
bytes. 

8.3. casex Instructions 

As considers the instructions caseb, easel, casew to have three operands. The dis¬ 
placements must be explicitly computed by as, using one or more .word statements. 

8.4. Extended branch instructions 

These opcodes (formed in general by substituting a “j” for the initial “b” of the stan¬ 
dard opcodes) take as branch destinations the name of a label in the current subsegment. It 
is an error if the destination is known to be in a different subsegment, and it is a warning if 
the destination is not defined within the object module being assembled. 

If the branch destination is close enough, then the corresponding short branch “b” 
instruction is assembled. Otherwise the assembler choses a sequence of one or more instruc¬ 
tions which together have the same effect as if the “b” instruction had a larger span. In gen¬ 
eral, as chooses the inverse branch followed by a brw, but a brw is sometimes pooled among 
several “j” instructions with the same destination. 

As is unable to perform the same long/short branch generation for other instructions 
with a fixed byte displacement, such as the sob, aob families, or for the acbx family of 
instructions which has a fixed word displacement. This would be desirable, but is prohibitive 
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because of the complexity of these instructions. 

If the -J assembler option is given, a jmp instruction is used instead of a brw instruc¬ 
tion for ALL “j” instructions with distant destinations. This makes assembly of large 
(>32K bytes) programs (inefficiently) possible. As does not try to use clever combinations of 
brb, brw and jmp instructions. The jmp instructions use PC relative addressing, with the 
length of the offset given by the -d assembler option. 

These are the extended branch instructions as recognizes: 


jeql 

jeqlu 

jneq 

jgeq 

jgequ 

jfiftr 

jleq 

jlequ 

jlss 

jbcc 

jbsc 

jbes 

jibe 

jibs 


jee 

jes 


jvc 

jvs 


jbe 

jbs 



jbr 

Note that jbr turns into brb if its target is close enough; otherwise a brw is used. 

9. Diagnostics 

Diagnostics are intended to be self explanatory and appear on the standard output. 
Diagnostics either report an error or a warning. Error diagnostics complain about lexical, 
syntactic and some semantic errors, and abort the assembly. 

The majority of the warnings complain about the use of VAX features not supported by 
all implementations of the architecture. As will warn if new opcodes are used, if “G” or “H” 
floating point numbers are used and will complain about mixed floating conversions. 

10. Limits 


limit 

what 

Arbitrary 3 

Files to assemble 

BUFSIZ 

Significant characters per name 

Arbitrary 

Characters per input line 

Arbitrary 

Characters per string 

Arbitrary 

Symbols 

4 

Text segments 

4 

Data segments 


11. Annoyances and Future Work 

Most of the annoyances deal with restrictions on the extended branch instructions. 

As only uses a two level algorithm for resolving extended branch instructions into short 
or long displacements. What is really needed is a general mechanism to turn a short condi¬ 
tional jump into a reverse conditional jump over one of two possible unconditional branches, 
either a brw or a jmp instruction. Currently, the -J forces the jmp instruction to always be 
used, instead of the shorter brw instruction when needed. 


^Although the number of characters available to the argv line is restricted by UNIX to 10240. 
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The assembler should also recognize extended branch instructions for sob, aob, and 
acbx instructions. Sob instructions will be easy, aob will be harder because the synthesized 
instruction uses the index operand twice, so one must be careful of side effects, and the acbx 
family will be much harder (in the general case) because the comparison depends on the sign 
of the addend operand, and two operands are used more than once. Augmenting as with 
these extended loop instructions will allow the peephole optimizer to produce much better 
loop optimizations, since it currently assumes the worst case about the size of the loop body. 

The string temporary file is not put in memory when the -V flag is set. The string table 
in the generated a.out contains some strings and names that are never referenced from the 
symbol table; the loader removes these unreferenced strings, however. 
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The UNIX I/O System 

Dennis M. Ritchie 

Bell Laboratories 
Murray Hill, New Jersey 07974 


This paper gives an overview of the workings of the UNIXt I/O system. It was written 
with an eye toward providing guidance to writers of device driver routines, and is oriented 
more toward describing the environment and nature of device drivers than the implementation 
of that part of the file system which deals with ordinary files. 

It is assumed that the reader has a good knowledge of the overall structure of the file sys¬ 
tem as discussed in the paper “The UNIX Time-sharing System.” A more detailed discussion 
appears in “UNIX Implementation;” the current document restates parts of that one, but is 
still more detailed. It is most useful in conjunction with a copy of the system code, since it is 
basically an exegesis of that code. 

Device Classes 

There are two classes of device: block and character. The block interface is suitable for 
devices like disks, tapes, and DECtape which work, or can work, with addressible 512-byte 
blocks. Ordinary magnetic tape just barely fits in this category, since by use of forward and 
backward spacing any block can be read, even though blocks can be written only at the end of 
the tape. Block devices can at least potentially contain a mounted file system. The interface 
to block devices is very highly structured; the drivers for these devices share a great many rou¬ 
tines as well as a pool of buffers. 

Character-type devices have a much more straightforward interface, although more work 
must be done by the driver itself. 

Devices of both types are named by a major and a minor device number. These numbers 
are generally stored as an integer with the minor device number in the low-order 8 bits and the 
major device number in the next-higher 8 bits; macros major and minor are available to access 
these numbers. The major device number selects which driver will deal with the device; the 
minor device number is not used by the rest of the system but is passed to the driver at 
appropriate times. Typically the minor number selects a subdevice attached to a given con¬ 
troller, or one of several similar hardware interfaces. 

The major device numbers for block and character devices are used as indices in separate 
tables; they both start at 0 and therefore overlap. 

Overview of I/O 

The purpose of the open and creat system calls is to set up entries in three separate sys¬ 
tem tables. The first of these is the u _ ofile table, which is stored in the system’s per-process 

data area u. This table is indexed by the file descriptor returned by the open or creat , and is 
accessed during a read , write , or other operation on the open file. An entry contains only a 
pointer to the corresponding entry of the file table, which is a per-system data base. There is 
one entry in the file table for each instance of open or creat. This table is per-system because 
the same instance of an open file must be shared among the several processes which can result 
from forks after the file is opened. A file table entry contains flags which indicate whether the 
file was open for reading or writing or is a pipe, and a count which is used to decide when all 

tUNIX is a Trademark of Bell Laboratories. 
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processes using the entry have terminated or closed the file (so the entry can be abandoned). 
There is also a 32-bit file offset which is used to indicate where in the file the next read or 
write will take place. Finally, there is a pointer to the entry for the file in the inode table, 
which contains a copy of the file’s i-node. 

Certain open files can be designated “multiplexed” files, and several other flags apply to 
such channels. In such a case, instead of an offset, there is a pointer to an associated multi¬ 
plex channel table. Multiplex channels will not be discussed here. 

An entry in the file table corresponds precisely to an instance of open or creat; if the 
same file is opened several times, it will have several entries in this table. However, there is at 
most one entry in the inode table for a given file. Also, a file may enter the inode table not 
only because it is open, but also because it is the current directory of some process or because 
it is a special file containing a currently-mounted file system. 

An entry in the inode table differs somewhat from the corresponding i-node as stored on 
the disk; the modified and accessed times are not stored, and the entry is augmented by a flag 
word containing information about the entry, a count used to determine when it may be 
allowed to disappear, and the device and i-number whence the entry came. Also, the several 
block numbers that give addressing information for the file are expanded from the 3-byte, 
compressed format used on the disk to full long quantities. 

During the processing of an open or creat call for a special file, the system always calls 
the device’s open routine to allow for any special processing required (rewinding a tape, turning 
on the data-terminal-ready lead of a modem, etc.). However, the close routine is called only 
when the last process closes a file, that is, when the i-node table entry is being deallocated. 
Thus it is not feasible for a device to maintain, or depend on, a count of its users, although it 
is quite possible to implement an exclusive-use device which cannot be reopened until it has 
been closed. 

When a read or write takes place, the user’s arguments and the file table entry are used to 

set up the variables u.u _ base, u.u _ count, and u.u _ offset which respectively contain the 

(user) address of the I/O target area, the byte-count for the transfer, and the current location 
in the file. If the file referred to is a character-type special file, the appropriate read or write 
routine is called; it is responsible for transferring data and updating the count and current 
location appropriately as discussed below. Otherwise, the current location is used to calculate a 
logical block number in the file. If the file is an ordinary file the logical block number must be 
mapped (possibly using indirect blocks) to a physical block number; a block-type special file 
need not be mapped. This mapping is performed by the bmap routine. In any event, the 
resulting physical block number is used, as discussed below, to read or write the appropriate 
device. 

Character Device Drivers 

The cdevsw table specifies the interface routines present for character devices. Each dev¬ 
ice provides five routines: open, close, read, write, and special-function (to implement the ioctl 
system call). Any of these may be missing. If a call on the routine should be ignored, (e.g. 
open on non-exclusive devices that require no setup) the cdevsw entry can be given as nulldev; 
if it should be considered an error, (e.g. write on read-only devices) nodev is used. For termi¬ 
nals, the cdevsw structure also contains a pointer to the tty structure associated with the termi¬ 
nal. 

The open routine is called each time the file is opened with the full device number as 
argument. The second argument is a flag which is non-zero only if the device is to be written 
upon. 

The close routine is called only when the file is closed for the last time, that is when the 
very last process in which the file is open closes it. This means it is not possible for the driver 
to maintain its own count of its users. The first argument is the device number; the second is 
a flag which is non-zero if the file was open for writing in the process which performs the final 
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close. 


When write is called, it is supplied the device as argument. The per-user variable 
u u _ count has been set to the number of characters indicated by the user; for character dev¬ 
ices, this number may be 0 initially, u.u _ base is the address supplied by the user from which 

to start taking characters. The system may call the routine internally, so the flag u.u_segflg is 

supplied that indicates, if on, that u.u _ base refers to the system address space instead of the 

user’s. 

The write routine should copy up to u.u _ count characters from the user’s buffer to the 

device, decrementing u.u_count for each character passed. For most drivers, which work one 
character at a time, the routine cpass( ) is used to pick up characters from the user’s buffer. 
Successive calls on it return the characters to be written until u.u_count goes to 0 or an error 

occurs, when it returns — 1 . Cpass takes care of interrogating u.u _ segflg and updating 

u.u _ count. 

Write routines which want to transfer a probably large number of characters into an 
internal buffer may also use the routine iomovefbuffer, offset, count, flag) which is faster when 
many characters must be moved. Iomove transfers up to count characters into the buffer start¬ 
ing offset bytes from the start of the buffer; flag should be B_WRITE (which is 0) in the write 
case. Caution: the caller is responsible for making sure the count is not too large and is non¬ 
zero. As an efficiency note, iomove is much slower if any of buffer+offset, count or u.u base is 
odd. 

The device’s read routine is called under conditions similar to write, except that 

u u _ count is guaranteed to be non-zero. To return characters to the user, the routine passc(c) 

is available; it takes care of housekeeping like cpass and returns —1 as the last character 
specified by u.u_count is returned to the user; before that time, 0 is returned. Iomove is also 
usable as with write; the flag should be B_READ but the same cautions apply. 

The “special-functions” routine is invoked by the stty and gtty system calls as follows: 
(*p) (dev, v) where p is a pointer to the device’s routine, dev is the device number, and v is a 
vector. In the gtty case, the device is supposed to place up to 3 words of status information 
into the vector; this will be returned to the caller. In the stty case, v is 0; the device should 
take up to 3 words of control information from the array u.u _ arg[0...2]. 

Finally, each device should have appropriate interrupt-time routines. When an interrupt 
occurs, it is turned into a C-compatible call on the devices’s interrupt routine. The interrupt- 
catching mechanism makes the low-order four bits of the “new PS” word in the trap vector for 
the interrupt available to the interrupt handler. This is conventionally used by drivers which 
deal with multiple similar devices to encode the minor device number. After the interrupt has 
been processed, a return from the interrupt handler will return from the interrupt itself. 

A number of subroutines are available which are useful to character device drivers. Most 
of these handlers, for example, need a place to buffer characters in the internal interface 
between their “top half’ (read/write) and “bottom half’ (interrupt) routines. For relatively low 
data-rate devices, the best mechanism is the character queue maintained by the routines getc 
and putc. A queue header has the structure 

struct { 

int c_cc; /* character count */ 

char ♦c cf; /* first character */ 

char *c cl; /* last character V 

j queue; 

A character is placed on the end of a queue by putc(c, &queue) where c is the character and 
queue is the queue header. The routine returns — 1 if there is no space to put the character, 0 
otherwise. The first character on the queue may be retrieved by getc(Aqueue) which returns 
either the (non-negative) character or — 1 if the queue is empty. 
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Notice that the space for characters in queues is shared among all devices in the system 
and in the standard system there are only some 600 character slots available. Thus device 
handlers, especially write routines, must take care to avoid gobbling up excessive numbers of 
characters. 

The other major help available to device handlers is the sleep-wakeup mechanism. The 
call sleep(event, priority) causes the process to wait (allowing other processes to run) until the 
event occurs; at that time, the process is marked ready-to-run and the call will return when 
there is no process with higher priority. 

The call wakeup(event) indicates that the event has happened, that is, causes processes 
sleeping on the event to be awakened. The event is an arbitrary quantity agreed upon by the 
sleeper and the waker-up. By convention, it is the address of some data area used by the 
driver, which guarantees that events are unique. 

Processes sleeping on an event should not assume that the event has really happened; 
they should check that the conditions which caused them to sleep no longer hold. 

Priorities can range from 0 to 127; a higher numerical value indicates a less-favored 
scheduling situation. A distinction is made between processes sleeping at priority less than the 
parameter PZERO and those at numerically larger priorities. The former cannot be inter¬ 
rupted by signals, although it is conceivable that it may be swapped out. Thus it is a bad idea 
to sleep with priority less than PZERO on an event which might never occur. On the other 
hand, calls to sleep with larger priority may never return if the process is terminated by some 
signal in the meantime. Incidentally, it is a gross error to call sleep in a routine called at inter¬ 
rupt time, since the process which is running is almost certainly not the process which should 
go to sleep. Likewise, none of the variables in the user area “u.” should be touched, let alone 
changed, by an interrupt routine. 

If a device driver wishes to wait for some event for which it is inconvenient or impossible 
to supply a wakeup, (for example, a device going on-line, which does not generally cause an 
interrupt), the call sleep(&lbolt, priority) may be given. Lbolt is an external cell whose address 
is awakened once every 4 seconds by the clock interrupt routine. 

The routines spl4( ), spl5( ), spl6( ), spl7( ) are available to set the processor priority level 
as indicated to avoid inconvenient interrupts from the device. 

If a device needs to know about real-time intervals, then timeout(func, arg, interval) will 
be useful. This routine arranges that after interval sixtieths of a second, the func will be called 
with arg as argument, in the style (*func)(arg). Timeouts are used, for example, to provide 
real-time delays after function characters like new-line and tab in typewriter output, and to 
terminate an attempt to read the 201 Dataphone dp if there is no response within a specified 
number of seconds. Notice that the number of sixtieths of a second is limited to 32767, since 
it must appear to be positive, and that only a bounded number of timeouts can be going on at 
once. Also, the specified func is called at clock-interrupt time, so it should conform to the 
requirements of interrupt routines in general. 

The Block-device Interface 

Handling of block devices is mediated by a collection of routines that manage a set of 
buffers containing the images of blocks of data on the various devices. The most important 
purpose of these routines is to assure that several processes that access the same block of the 
same device in multiprogrammed fashion maintain a consistent view of the data in the block. 
A secondary but still important purpose is to increase the efficiency of the system by keeping 
in-core copies of blocks that are being accessed frequently. The main data base for this 
mechanism is the table of buffers buf. Each buffer header contains a pair of pointers (b_forw, 

b _ back) which maintain a doubly-linked list of the buffers associated with a particular block 

device, and a pair of pointers (av_Jorw, av_back) which generally maintain a doubly-linked 
list of blocks which are “free,” that is, eligible to be reallocated for another transaction. 
Buffers that have I/O in progress or are busy for other purposes do not appear in this list. 
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The buffer header also contains the device and block number to which the buffer refers, and a 
pointer to the actual storage associated with the buffer. There is a word count which is the 
negative of the number of words to be transferred to or from the buffer; there is also an error 
byte and a residual word count used to communicate information from an I/O routine to its 
caller. Finally, there is a flag word with bits indicating the status of the buffer. These flags 
will be discussed below. 

Seven routines constitute the most important part of the interface with the rest of the 
system. Given a device and block number, both bread and getblk return a pointer to a buffer 
header for the block; the difference is that bread is guaranteed to return a buffer actually con¬ 
taining the current data for the block, while getblk returns a buffer which contains the data in 

the block only if it is already in core (whether it is or not is indicated by the B _ DONE bit; see 

below). In either case the buffer, and the corresponding device block, is made “busy,” so that 
other processes referring to it are obliged to wait until it becomes free. Getblk is used, for 
example, when a block is about to be totally rewritten, so that its previous contents are not 
useful; still, no other process can be allowed to refer to the block until the new data is placed 
into it. 

The breada routine is used to implement read-ahead, it is logically similar to bread, but 
takes as an additional argument the number of a block (on the same device) to be read asyn¬ 
chronously after the specifically requested block is available. 

Given a pointer to a buffer, the brelse routine makes the buffer again available to other 
processes. It is called, for example, after data has been extracted following a bread. There are 
three subtly-different write routines, all of which take a buffer pointer as argument, and all of 
which logically release the buffer for use by others and place it on the free list. Bwrite puts the 
buffer on the appropriate device queue, waits for the write to be done, and sets the user’s error 
flag if required. Bawrite places the buffer on the device’s queue, but does not wait for comple¬ 
tion, so that errors cannot be reflected directly to the user. Bdwrite does not start any I/O 
operation at all, but merely marks the buffer so that if it happens to be grabbed from the free 
list to contain data from some other block, the data in it will first be written out. 

Bwrite is used when one wants to be sure that I/O takes place correctly, and that errors 
are reflected to the proper user; it is used, for example, when updating i-nodes. Bawrite is use¬ 
ful when more overlap is desired (because no wait is required for I/O to finish) but when it is 
reasonably certain that the write is really required. Bdwrite is used when there is doubt that 
the write is needed at the moment. For example, bdwrite is called when the last byte of a write 
system call falls short of the end of a block, on the assumption that another write will be given 
soon which will re-use the same block. On the other hand, as the end of a block is passed, 
bawrite is called, since probably the block will not be accessed again soon and one might as 
well start the writing process as soon as possible. 

In any event, notice that the routines getblk and bread dedicate the given block 
exclusively to the use of the caller, and make others wait, while one of brelse, bwrite, bawrite, 
or bdwrite must eventually be called to free the block for use by others. 

As mentioned, each buffer header contains a flag word which indicates the status of the 
buffer. Since they provide one important channel for information between the drivers and the 
block I/O system, it is important to understand these flags. The following names are manifest 
constants which select the associated flag bits. 

B_READ This bit is set when the buffer is handed to the device strategy routine (see below) 

to indicate a read operation. The symbol B _ WRITE is defined as 0 and does not 

define a flag; it is provided as a mnemonic convenience to callers of routines like 
swap which have a separate argument which indicates read or write. 

B_DONE This bit is set to 0 when a block is handed to the the device strategy routine and is 

turned on when the operation completes, whether normally as the result of an error. 
It is also used as part of the return argument of getblk to indicate if 1 that the 
returned buffer actually contains the data in the requested block. 
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B ERRORThis bit may be set to 1 when B _ DONE is set to indicate that an I/O or other 

error occurred. If it is set the b__error byte of the buffer header may contain an 
error code if it is non-zero. If b_error is 0 the nature of the error is not specified. 
Actually no driver at present sets b_error; the latter is provided for a future 
improvement whereby a more detailed error-reporting scheme may be implemented. 

B BUSY This bit indicates that the buffer header is not on the free list, i.e. is dedicated to 
someone’s exclusive use. The buffer still remains attached to the list of blocks asso¬ 
ciated with its device, however. When getblk (or bread, which calls it) searches the 
buffer list for a given device and finds the requested block with this bit on, it sleeps 
until the bit clears. 

B PHYS This bit is set for raw I/O transactions that need to allocate the Unibus map on an 
11/70. 

B MAP This bit is set on buffers that have the Unibus map allocated, so that the iodone 
routine knows to deallocate the map. 

B_W ANTED 

'Phis flag is used in conjunction with the B _ BUSY bit. Before sleeping as 

described just above, getblk sets this flag. Conversely, when the block is freed and 
the busy bit goes down (in brelse) a wakeup is given for the block header whenever 
B WANTED is on. This strategem avoids the overhead of having to call wakeup 
every time a buffer is freed on the chance that someone might want it. 

B AGE This bit may be set on buffers just before releasing them; if it is on, the buffer is 
placed at the head of the free list, rather than at the tail. It is a performance 
heuristic used when the caller judges that the same block will not soon be used 
again. 

B ASYNCThis bit is set by bawrite to indicate to the appropriate device driver that the 
buffer should be released when the write has been finished, usually at interrupt 
time. The difference between bwrite and bawrite is that the former starts I/O, waits 
until it is done, and frees the buffer. The latter merely sets this bit and starts I/O. 
The bit indicates that relse should be called for the buffer on completion. 

B_DELWRI 

This bit is set by bdwrite before releasing the buffer. When getblk, while searching 
for a free block, discovers the bit is 1 in a buffer it would otherwise grab, it causes 
the block to be written out before reusing it. 

Block Device Drivers 

The bdevsw table contains the names of the interface routines and that of a table for 
each block device. 

Just as for character devices, block device drivers may supply an open and a close routine 
called respectively on each open and on the final close of the device. Instead of separate read 
and write routines, each block device driver has a strategy routine which is called with a 
pointer to a buffer header as argument. As discussed, the buffer header contains a read/write 
flag, the core address, the block number, a (negative) word count, and the major and minor 
device number. The role of the strategy routine is to carry out the operation as requested by 
the information in the buffer header. When the transaction is complete the B—DONE (and 

possibly the B _ ERROR) bits should be set. Then if the B _ ASYNC bit is set, brelse should be 

called; otherwise, wakeup. In cases where the device is capable, under error-free operation, of 
transferring fewer words than requested, the device’s word-count register should be placed in 
the residual count slot of the buffer header; otherwise, the residual count should be set to 0. 
This particular mechanism is really for the benefit of the magtape driver; when reading this 
device records shorter than requested are quite normal, and the user should be told the actual 
length of the record. 
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Although the most usual argument to the strategy routines is a genuine buffer header 
allocated as discussed above, all that is actually required is that the argument be a pointer to a 
place containing the appropriate information. For example the swap routine, which manages 
movement of core images to and from the swapping device, uses the strategy routine for this 
device. Care has to be taken that no extraneous bits get turned on in the flag word. 

The device s table specified by bdevsw has a byte to contain an active flag and an error 
count, a pair of links which constitute the head of the chain of buffers for the device (b_forw, 
b — back), and a first and last pointer for a device queue. Of these things, all are used solely by 
the device driver itself except for the buffer-chain pointers. Typically the flag encodes the 
state of the device, and is used at a minimum to indicate that the device is currently engaged 
in transferring information and no new command should be issued. The error count is useful 
for counting retries when errors occur. The device queue is used to remember stacked 
requests; in the simplest case it may be maintained as a first-in first-out list. Since buffers 
which have been handed over to the strategy routines are never on the list of free buffers, the 
pointers in the buffer which maintain the free list (av_forw, av_back) are also used to contain 
the pointers which maintain the device queues. 

A couple of routines are provided which are useful to block device drivers, iodone(bp) 
arranges that the buffer to which bp points be released or awakened, as appropriate, when the 
strategy module has finished with the buffer, either normally or after an error. (In the latter 
case the B _ ERROR bit has presumably been set.) 

The routine geterror(bp) can be used to examine the error bit in a buffer header and 
arrange that any error indication found therein is reflected to the user. It may be called only 
in the non-interrupt part of a driver when I/O has completed (B _ DONE has been set). 

Raw Block-device I/O 

A scheme has been set up whereby block device drivers may provide the ability to 
transfer information directly between the user’s core image and the device without the use of 
buffers and in blocks as large as the caller requests. The method involves setting up a 
character-type special file corresponding to the raw device and providing read and write rou¬ 
tines which set up what is usually a private, non-shared buffer header with the appropriate 
information and call the device’s strategy routine. If desired, separate open and close routines 
may be provided but this is usually unnecessary. A special-function routine might come in 
handy, especially for magtape. 

A great deal of work has to be done to generate the “appropriate information” to put in 
the argument buffer for the strategy module; the worst part is to map relocated user addresses 
to physical addresses. Most of this work is done by physio(strat, bp, dev, rw) whose arguments 
are the name of the strategy routine strat, the buffer pointer bp, the device number dev, and a 

read-write flag rw whose value is either B _ READ or B _ WRITE. Physio makes sure that the 

user’s base address and count are even (because most devices work in words) and that the core 
area affected is contiguous in physical space; it delays until the buffer is not busy, and makes it 
busy while the operation is in progress; and it sets up user error return information. 
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ABSTRACT 


This document describes a package of C library functions which allow the user to: 

• update a screen with reasonable optimization, 

• get input from the terminal in a screen-oriented fashion, and 

• independent from the above, move the cursor optimally from one point to another. 

These routines all use the /etc/termcap database to describe the capabilities of the 

terminal. 
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1. Overview 

In making available the generalized terminal descriptions in /etc/termcap, much infor¬ 
mation was made available to the programmer, but little work was taken out of one’s hands. 
The purpose of this package is to allow the C programmer to do the most common type of ter¬ 
minal dependent functions, those of movement optimization and optimal screen updating, 
without doing any of the dirty work, and (hopefully) with nearly as much ease as is necessary 
to simply print or read things. 

The package is split into three parts: (1) Screen updating; (2) Screen updating with user 
input; and (3) Cursor motion optimization. 

It is possible to use the motion optimization without using either of the other two, and 
screen updating and input can be done without any programmer knowledge of the motion op¬ 
timization, or indeed the database itself. 

1.1. Terminology (or, Words You Can Say to Sound Brilliant) 

In this document, the following terminology is kept to with reasonable consistency: 

window: An internal representation containing an image of what a section of the terminal 
screen may look like at some point in time. This subsection can either encompass the 
entire terminal screen, or any smaller portion down to a single character within that 
screen. 

terminal : Sometimes called terminal screen. The package’s idea of what the terminal’s screen 
currently looks like, i.e., what the user sees now. This is a special screen: 

screen: This is a subset of windows which are as large as the terminal screen, i.e., they start at 
the upper left hand corner and encompass the lower right hand corner. One of these, 
stdscr , is automatically provided for the programmer. 

1.2. Compiling Things 

In order to use the library, it is necessary to have certain types and variables defined. 
Therefore, the programmer must have a line: 

#include <curses.h> 

at the top of the program source. The header file <curses.h> needs to include <sgtty.h>, so 
the one should not do so oneself 1 . Also, compilations should have the following form: 
cc [ flags ] file ... -lcurses -ltermlib 

1.3. Screen Updating 

In order to update the screen optimally, it is necessary for the routines to know what the 
screen currently looks like and what the programmer wants it to look like next. For this pur¬ 
pose, a data type (structure) named WINDOW is defined which describes a window image to 
the routines, including its starting position on the screen (the (y, x) co-ordinates of the upper 
left hand corner) and its size. One of these (called curscr for current screen ) is a screen image 
of what the terminal currently looks like. Another screen (called stdscr y for standard screen) is 
provided by default to make changes on. 

A window is a purely internal representation. It is used to build and store a potential im¬ 
age of a portion of the terminal. It doesn’t bear any necessary relation to what is really on the 
terminal screen. It is more like an array of characters on which to make changes. 

1 The screen package also uses the Standard I/O library, so <curses.h> includes <stdio.h>. It is redundant 
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When one has a window which describes what some part the terminal should look like, 
the routine refresh() (or wrefresh() if the window is not stdscr) is called. refresh() makes the 
terminal, in the area covered by the window, look like that window. Note, therefore, that 
changing something on a window does not change the terminal. Actual updates to the terminal 
screen are made only by calling refreshQ or wrefresh(). This allows the programmer to main¬ 
tain several different ideas of what a portion of the terminal screen should look like. Also, 
changes can be made to windows in any order, without regard to motion efficiency. Then, at 
will, the programmer can effectively say “make it look like this,” and let the package worry 
about the best way to do this. 

1.4. Naming Conventions 

As hinted above, the routines can use several windows, but two are automatically given: 
cursor, which knows what the terminal looks like, and stdscr, which is what the programmer 
wants the terminal to look like next. The user should never really access cursor directly. 
Changes should be made to the appropriate screen, and then the routine refreshf) (or 
wrefresh()) should be called. 

Many functions are set up to deal with stdscr as a default screen. For example, to add a 
character to stdscr, one calls addch() with the desired character. If a different window is to be 
used, the routine waddch() (for window-specific addch()) is provided * 2 . This convention of 
prepending function names with a “w” when they are to be applied to specific windows is con¬ 
sistent. The only routines which do not do this are those to which a window must always be 
specified. 

In order to move the current (y, x) co-ordinates from one point to another, the routines 
move() and wmove() are provided. However, it is often desirable to first move and then per¬ 
form some I/O operation. In order to avoid clumsyness, most I/O routines can be preceded by 
the prefix “mv” and the desired (y, x) co-ordinates then can be added to the arguments to the 
function. For example, the calls 

move(y, x); 
addch(ch); 

can be replaced by 

mvaddch(y, x, ch); 

and 

wmove(win, y, x); 
waddch(win, ch); 

can be replaced by 

mvwaddch(win, y, x, ch); 

Note that the window description pointer (win) comes before the added (y, x) co-ordinates. If 
such pointers are need, they are always the first parameters passed. 


(but harmless) for the programmer to do it, too. 

2 Actually, addch() is really a “#define” macro with arguments, as are most of the "functions" which deal with 
stdscr as a default. 
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2. Variables 

Many variables which are used to describe the terminal environment are available to the 
programmer. They are: 

tyP e _ name _ description _ 

current version of the screen (terminal screen), 
standard screen. Most updates are usually done here, 
default terminal type if type cannot be determined 
use the terminal specification in Def_term as termi¬ 
nal, irrelevant of real terminal type 
full name of the current terminal, 
number of lines on the terminal 
number of columns on the terminal 
error flag returned by routines on a fail, 
error flag returned by routines when things go right. 

There are also several “^define” constants and types which are of general usefulness: 

reg storage class “register” (e.g., reg int i;) 

bool boolean type, actually a “char” (e.g., bool doneit;) 

TRUE boolean “true” flag (1). 

FALSE boolean “false” flag (0). 

3. Usage 

This is a description of how to actually use the screen package. In it, we assume all up¬ 
dating, reading, etc. is applied to stdscr. All instructions will work on any window, with 
changing the function name and parameters as mentioned above. 

3.1. Starting up 

In order to use the screen package, the routines must know about terminal characteris¬ 
tics, and the space for cursor and stdscr must be allocated. These functions are performed by 
initscr(). Since it must allocate space for the windows, it can overflow core when attempting to 
do so. On this rather rare occasion, initscr() returns ERR. initscrf) must always be called be¬ 
fore any of the routines which affect windows are used. If it is not, the program will core 
dump as soon as either cursor or stdscr are referenced. However, it is usually best to wait to 
call it until after you are sure you will need it, like after checking for startup errors. Terminal 
status changing routines like nl() and crmode() should be called after initscr(). 

Now that the screen windows have been allocated, you can set them up for the run. If 
you want to, say, allow the window to scroll, use scrollok(). If you want the cursor to be left 
after the last change, use leaveok(). If this isn’t done, refresh() will move the cursor to the 
window’s current (y, x) co-ordinates after updating it. New windows of your own can be creat¬ 
ed, too, by using the functions newwin() and subwin(). delwin() will allow you to get rid of old 
windows. If you wish to change the official size of the terminal by hand, just set the variables 
LINES and COLS to be what you want, and then call initscr(). This is best done before, but 
can be done either before or after, the first call to initscr() y as it will always delete any existing 
stdscr and/or cursor before creating new ones. 

3.2. The Nitty-Gritty 
3.2.1. Output 

Now that we have set things up, we will want to actually update the terminal. The basic 
functions used to change what will go on a window are addch() and moue(). addch() adds a 
character at the current (y, x) co-ordinates, returning ERR if it would cause the window to ille- 


W1IM JJU W ’ 

WINDOW * 

cursor 

stdscr 

char * 

Def _ term 

bool 

\. 

My _ term 

char * 

ttytype 

int 

LINES 

int 

COLS 

int 

ERR 

int 

OK 


287 








Screen Package 


gaily scroll, i.e., printing a character in the lower right-hand corner of a terminal which au¬ 
tomatically scrolls if scrolling is not allowed. move() changes the current (y, x) co-ordinates to 
whatever you want them to be. It returns ERR if you try to move off the window when scrol¬ 
ling is not allowed. As mentioned above, you can combine the two into mvaddchf) to do both 
things in one fell swoop. 

The other output functions, such as addstr() and printw(), all call addch() to add charac¬ 
ters to the window. 


After you have put on the window what you want there, when you want the portion of 
the terminal covered by the window to be made to look like it, you must call refresh(). In order 
to optimize finding changes, refresh() assumes that any part of the window not changed since 
the last refresh() of that window has not been changed on the terminal, i.e., that you have not 
refreshed a portion of the terminal with an overlapping window. If this is not the case, the 
routine touchwin() is provided to make it look like the entire window has been changed, thus 
making refresh() check the whole subsection of the terminal for changes. 

If you call wrefresh() with cursor, it will make the screen look like cursor thinks it looks 
like. This is useful for implementing a command which would redraw the screen in case it get 
messed up. 


3.2.2. Input 

Input is essentially a mirror image of output. The complementary function to addch() is 
getch() which, if echo is set, will call addch() to echo the character. Since the screen package 
needs to know what is on the terminal at all times, if characters are to be echoed, the tty must 
be in raw or cbreak mode. If it is not, getch() sets it to be cbreak, and then reads in the char¬ 
acter. 

3.2.3. Miscellaneous 

All sorts of fun functions exists for maintaining and changing information about the win¬ 
dows. For the most part, the descriptions in section 5.4. should suffice. 


3.3. Finishing up 

In order to do certain optimizations, and, on some terminals, to work at all, some things 
must be done before the screen routines start up. These functions are performed in 
getttmodeO and setterm(), which are called by initscr(). In order to clean up after the routines, 
the routine endwin() is provided. It restores tty modes to what they were when imtscr() was 
first called. Thus, anytime after the call to initscr, endu>in() should be called before exiting. 

4. Cursor Motion Optimization: Standing Alone 

It is possible to use the cursor optimization functions of this screen package without the 
overhead and additional size of the screen updating functions. The screen updating functions 
are designed for uses where parts of the screen are changed, but the overall image remains the 
same. This includes such programs as eye and vi 3 . Certain other programs will find it 
difficult to use these functions in this manner without considerable unnecessary program over¬ 
head. For such applications, such as some “crt hacks” 4 and optimizing cat(l)-type programs, 
all that is needed is the motion optimizations. This, therefore, is a description of what some of 
what goes on at the lower levels of this screen package. The descriptions assume a certain 
amount of familiarity with programming problems and some finer points of C. None of it is 
terribly difficult, but you should be forewarned. 


Eye actually uses these functions, vi does not. 

4 Graphics programs designed to run on character-oriented terminals. I could name many, but they come and go, 
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4.1. Terminal Information 

In order to use a terminal’s features to the best of a program’s abilities, it must first know 
what they are * 5 . The /etc/termcap database describes these, but a certain amount of decoding 
is necessary, and there are, of course, both efficient and inefficient ways of reading them in. 
The algorithm that the uses is taken from vi and is hideously efficient. It reads them in a 
tight loop into a set of variables whose names are two uppercase letters with some mnemonic 
value. For example, HO is a string which moves the cursor to the ’’home” position 6 . As there 
are two types of variables involving ttys, there are two routines. The first, gettmodeO , sets 
some variables based upon the tty modes accessed by gtty(2) and stty(2) The second, set- 
term(), a larger task by reading in the descriptions from the /etc/termcap database. This is 
the way these routines are used by initscr(): 

if (isatty(O)) { 
gettmodeO; 

if (sp = getenv(”TERM”)) 
setterm(sp); 

) 

else 

setterm(Def _ term); 

_puts(TI); 

_puts(VS); 

isattyO checks to see if file descriptor 0 is a terminal 7 . If it is, gettmodeO sets the termi¬ 
nal description modes from a gtty(2) getenv() is then called to get the name of the terminal, 
and that value (if there is one) is passed to settermf), which reads in the variables from 
/etc/termcap associated with that terminal. (getenvO returns a pointer to a string containing 
the name of the terminal, which we save in the character pointer sp.) If isattyO returns false, 
the default terminal Def_term is used. The 77 and VS sequences initialize the terminal 
(_putsO is a macro which uses tputsO (see termcap(3)) to put out a string). It is these things 
which endwin() undoes. 

4.2. Movement Optimizations, or, Getting Over Yonder 

Now that we have all this useful information, it would be nice to do something with it 8 . 
The most difficult thing to do properly is motion optimization. When you consider how many 
different features various terminals have (tabs, backtabs, non-destructive space, home se¬ 
quences, absolute tabs, .) you can see that deciding how to get from here to there can be a 

decidedly non-trivial task. The editor vi uses many of these features, and the routines it uses 
to do this take up many pages of code. Fortunately, I was able to liberate them with the 
author’s permission, and use them here. 

After using gettmodeO and setterm() to get the terminal descriptions, the function 
mvcurO deals with this task. It usage is simple: you simply tell it where you are now and 
where you want to go. For example 


so the list would be quickly out of date. Recently, there have been programs such as rocket and gun. 

5 If this comes as any surprise to you, there’s this tower in Paris they’re thinking of junking that I can let you 
have for a song. 

6 These names are identical to those variables used in the /etc/termcap database to describe each capability. 
See Appendix A for a complete list of those read, and termcap(5) for a full description. 

/ isattyO is defined in the default C library function routines. It does a gtty(2) on the descriptor and checks the 
return value. 

8 Actually, it can be emotionally fulfilling just to get the information. This is usually only true, however, if you 
have the social life of a kumquat. 
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mvcur(0, 0, LINES/2, COLS/2) 

would move the cursor from the home position (0, 0) to the middle of the screen. If you wish 
to force absolute addressing, you can use the function tgoto() from the termlib(7) routines, or 
you can tell mvcur() that you are impossibly far away, like Cleveland. For example, to abso¬ 
lutely address the lower left hand corner of the screen from anywhere just claim that you are in 
the upper right hand corner: 

mvcur(0, COLS-1, LINES-1, 0) 


5. The Functions 

In the following definitions, “t” means that the “function” is really a “#define” macro 
with arguments. This means that it will not show up in stack traces in the debugger, or, in the 
case of such functions as addch(), it will show up as it’s “w” counterpart. The arguments are 
given to show the order and type of each. Their names are not mandatory, just suggestive. 

5.1. Output Functions 


addch(ch) t 

char ch; 

waddchfwin, ch) 

WINDOW *win; 
char ch; 

Add the character ch on the window at the current (y, x) co-ordinates. If the character is 
a newline ('\n') the line will be cleared to the end, and the current (y, x) co-ordinates will 
be changed to the beginning off the next line if newline mapping is on, or to the next line 
at the same x co-ordinate if it is off. A return ('\r') will move to the beginning of the line 
on the window. Tabs ('\t') will be expanded into spaces in the normal tabstop positions 
of every eight characters. This returns ERR if it would cause the screen to scroll illegally. 


addstr(str) t 

char *str; 

waddstr(win, str) 

WINDOW *win; 
char *str; 

Add the string pointed to by str on the window at the current (y, x) co-ordinates. This 
returns ERR if it would cause the screen to scroll illegally. In this case, it will put on as 
much as it can. 


box(win, vert, hor) 

WINDOW *win; 
char vert, hor; 

Draws a box around the window using vert as the character for drawing the vertical sides, 
and hor for drawing the horizontal lines. If scrolling is not allowed, and the window en¬ 
compasses the lower right-hand corner of the terminal, the corners are left blank to avoid 
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a scroll. 


clear() t 

wclear(win) 

WINDOW *win; 

Resets the entire window to blanks. If win is a screen, this sets the clear flag, which will 
cause a clear-screen sequence to be sent on the next refreshO call. This also moves the 
current (y, x) co-ordinates to (0, 0). 


clearok(scr, boolf) t 

WINDOW *scr; 
bool boolf; 

Sets the clear flag for the screen scr. If boolf is TRUE, this will force a clear-screen to be 
printed on the next refreshO, or stop it from doing so if boolf is FALSE. This only works 
on screens, and, unlike clearO, does not alter the contents of the screen. If scr is cursor, 
the next refresh0 call will cause a clear-screen, even if the window passed to refreshO is 
not a screen. 


clrtobot() t 


wclrtobot(win) 

WINDOW *win; 

Wipes the window clear from the current (y, x) co-ordinates to the bottom. This does 
not force a clear-screen sequence on the next refresh under any circumstances. This has 
no associated “mv” command. 


clrtoeol() t 

wclrtoeol(win) 

WINDOW *win; 

Wipes the window clear from the current (y, x) co-ordinates to the end of the line. This 
has no associated “mv” command. 


delch() 

wdelch(win) 

WINDOW *win; 

Delete the character at the current (y, x) co-ordinates. Each character after it on the line 
shifts to the left, and the last character becomes blank. 
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deleteln() 

wdeleteln(win) 

WINDOW *win; 

Delete the current line. Every line below the current one will move up, and the bottom 
line will become blank. The current (y, x) co-ordinates will remain unchanged. 


erase() t 

werase(win) 

WINDOW *win; 

Erases the window to blanks without setting the clear flag. This is analagous to clear(), 
except that it never causes a clear-screen sequence to be generated on a refresh(). This 
has no associated “mv” command. 


insch(c) 

char c; 

winsch(win, c) 

WINDOW *u>in; 
char c; 

Insert c at the current (y, x) co-ordinates Each character after it shifts to the right, and 
the last character disappears. This returns ERR if it would cause the screen to scroll ille¬ 
gally. 


insertln() 

winsertln(win) 

WINDOW *win; 

Insert a line above the current one. Every line below the current line will be shifted 
down, and the bottom line will disappear. The current line will become blank, and the 
current (y, x) co-ordinates will remain unchanged. This returns ERR if it would cause 
the screen to scroll illegally. 


move(y, x) t 

int y, x; 

wmove(win, y, x) 

WINDOW *win; 
int y, x; 

Change the current (y, x) co-ordinates of the window to (y, x). This returns ERR if it 
would cause the screen to scroll illegally. 
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overlay(winl, win2) 

WINDOW *11)1x11, *win2; 

Overlay winl on win2. The contents of winl , insofar as they fit, are placed on win2 at 
their starting (y, x) co-ordinates. This is done non-destructively, i.e., blanks on winl 
leave the contents of the space on win2 untouched. 


overwrite(winl, win2) 

WINDOW *winl, *win2 ; 

Overwrite winl on win2. The contents of winl , insofar as they fit, are placed on win2 at 
their starting (y, x) co-ordinates. This is done destructively, i.e., blanks on winl become 
blank on win2. 


printw(fmt, argl, arg2, ...) 

char *fmt; 


wprintw(win, fmt, argl, arg2, ...) 

WINDOW *win; 
char *fmt; 

Performs a printfO on the window starting at the current (y, x) co-ordinates. It uses 
addstrO to add the string on the window. It is often advisable to use the field width op¬ 
tions of printfO to avoid leaving things on the window from earlier calls. This returns 
ERR if it would cause the screen to scroll illegally. 


refresh() t 


wrefresh(win) 

WINDOW *win; 

Synchronize the terminal screen with the desired window. If the window is not a screen, 
only that part covered by it is updated. This returns ERR if it would cause the screen to 
scroll illegally. In this case, it will update whatever it can without causing the scroll. 


standout() t 


wstandout(win) 

WINDOW *win; 


standendO t 


wstandend(win) 

WINDOW *win; 

Start and stop putting characters onto win in standout mode, standout) causes any 
characters added to the window to be put in standout mode on the terminal (if it has that 
capability). standendO stops this. The sequences SO and SE (or US and UE if they are 
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not defined) are used (see Appendix A). 

5.2. Input Functions 


crmodeO t 
nocrmodeO t 

Set or unset the terminal to/from cbreak mode. 

echo() t 
noecho() t 

Sets the terminal to echo or not echo characters. 


getch() t 


wgetch(win) 

WINDOW *win; 

Gets a character from the terminal and (if necessary) echos it on the window. This re¬ 
turns ERR if it would cause the screen to scroll illegally. Otherwise, the character gotten 
is returned. If noecho has been set, then the window is left unaltered. In order to retain 
control of the terminal, it is necessary to have one of noecho, cbreak, or rawmode set. If 
you do not set one, whatever routine you call to read characters will set cbreak for you, 
and then reset to the original mode when finished. 


getstr(str) t 

char *str; 

wgetstr(win, str) 

WINDOW *win; 
char *str; 

Get a string through the window and put it in the location pointed to by str, which is as¬ 
sumed to be large enough to handle it. It sets tty modes if necessary, and then calls 
getch() (or wgetch(win)) to get the characters needed to fill in the string until a newline or 
EOF is encountered. The newline stripped off the string. This returns ERR if it would 
cause the screen to scroll illegally. 


raw() t 
noraw() t 

Set or unset the terminal to/from raw mode. On version 7 UNIX 9 this also turns of new- 
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line mapping (see nlQ). 


scanw(fmt, argl, arg2, ...) 

char *fmt; 


wscanw(win, fmt, argl, arg2, ...) 

WINDOW *win; 
char *fmt; 

Perform a scanfO through the window using fmt. It does this using consecutive getchO'a 
(or wgetch(win)’s). This returns ERR if it would cause the screen to scroll illegally. 

5.3. Miscellaneous Functions 


delwin(win) 

WINDOW *win; 

Deletes the window from existence. All resources are freed for future use by calloc(3). If 
a window has a subwinO allocated window inside of it, deleting the outer window the 
subwindow is not affected, even though this does invalidate it. Therefore, subwindows 
should be deleted before their outer windows are. 


endwin() 

Finish up window routines before exit. This restores the terminal to the state it was be¬ 
fore initscr() (or gettmode() and settermO) was called. It should always be called before 
exiting. It does not exit. This is especially useful for resetting tty stats when trapping 
rubouts via signal(2). 


getyx(win, y, x) t 
WINDOW *win; 
int y, x; 

Puts the current (y, x) co-ordinates of win in the variables y and x . Since it is a macro, 
not a function, you do not pass the address of y and x. 


inch() t 

winch(win) t 

WINDOW *win ; 

Returns the character at the current (y, x) co-ordinates on the given window. This does 
not make any changes to the window. This has no associated “mv” command. 


9 UNIX is a trademark of Bell Laboratories. 
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initscrO 

Initialize the screen routines. This must be called before any of the screen routines are 
used. It initializes the terminal-type data and such, and without it, none of the routines 
can operate. If standard input is not a tty, it sets the specifications to the terminal 
whose name is pointed to by Def_term (initialy "dumb"). If the boolean My_term is 
true, Def _ term is always used. 


leaveok(win, boolf) t 

WINDOW *win; 
bool boolf; 

Sets the boolean flag for leaving the cursor after the last change. If boolf is TRUE, the 
cursor will be left after the last update on the terminal, and the current (y, x) co¬ 
ordinates for win will be changed accordingly. If it is FALSE, it will be moved to the 
current (y, x) co-ordinates. This flag (initialy FALSE) retains its value until changed by 
the user. 


longname(termbuf, name) 

char *termbuf, *name; 

Fills in name with the long (full) name of the terminal described by the termcap entry in 
termbuf. It is generally of little use, but is nice for telling the user in a readable format 
what terminal we think he has. This is available in the global variable ttytype. Termbuf 
is usually set via the termlib routine tgetentQ. 


mvwin(win, y, x) 

WINDOW *win; 
int y, x; 

Move the home position of the window win from its current starting coordinates to (y, x). 
If that would put part or all of the window off the edge of the terminal screen, mvwin() 
returns ERR and does not change anything. 

WINDOW * 

newwin(lines, cols, begin _y, begin _x) 

int lines, cols, begin_y, begin_x; 

Create a new window with lines lines and cols columns starting at position 
(begin_y, begin _x). If either lines or cols is 0 (zero), that dimension will be set to 
(LINES - begin_y) or (COLS - begin_x) respectively. Thus, to get a new window of 
dimensions LINES X COLS, use newwin(0, 0, 0, 0). 


mot 

nonl() t 

Set or unset the terminal to/from nl mode, i.e., start/stop the system from mapping 
<RETURN> to <LINE-FEED>. If the mapping is not done, refresh() can do more 
optimization, so it is recommended, but not required, to turn it off. 
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scrollok(win, boolf) t 

WINDOW *win; 
bool boolf; 

Set the scroll flag for the given window. If boolf is FALSE, scrolling is not allowed. This 
is its default setting. 


touchwin(win) 

WINDOW *win; 

Make it appear that the every location on the window has been changed. This is usually 
only needed for refreshes with overlapping windows. 

WINDOW * 

subwin(win, lines, cols, begin _y, begin x) 

WINDOW *win; 

int lines, cols, begin_y, begin_x; 

Create a new window with lines lines and cols columns starting at position 

(begin _ y, begin _ x) in the middle of the window win. This means that any change made 

to either window in the area covered by the subwindow will be made on both windows. 
begin_y, begin_x are specified relative to the overall screen, not the relative (0, 0) of 
win. If either lines or cols is 0 (zero), that dimension will be set to ( LINES - begin_y) 
or ( COLS - begin_x) respectively. 


unctrl(ch) t 

char ch; 

This is actually a debug function for the library, but it is of general usefulness. It returns 
a string which is a representation of ch. Control characters become their upper-case 
equivalents preceded by a """. Other letters stay just as they are. To use unctrlf), you 
must have #include <unctrl.h> in your file. 

5.4. Details 


gettmodeO 

Get the tty stats. This is normally called by initscrQ. 


mvcur(lasty, lastx, newy, newx) 

int lasty, lastx, newy, newx; 

Moves the terminal’s cursor from (lasty, lastx) to (newy, newx) in an approximation of 
optimal fashion. This routine uses the functions borrowed from ex version 2.6. It is pos¬ 
sible to use this optimization without the benefit of the screen routines. With the screen 
routines, this should not be called by the user. moveO and refresh() should be used to 
move the cursor position, so that the routines know what’s going on. 
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scroll(win) 

WINDOW *win; 

Scroll the window upward one line. This is normally not used by the user. 


savettyO t 
resettyO t 

savettyO saves the current tty characteristic flags. resettyO restores them to what savet¬ 
tyO stored. These functions are performed automatically by initscr() and endwinO • 


setterm(name) 

char *name; 

Set the terminal characteristics to be those of the terminal named name. This is normal¬ 
ly called by initscrQ. 


tstp() 

If the new tty(4) driver is in use, this function will save the current tty state and then 
put the process to sleep. When the process gets restarted, it restores the tty state and 
then calls wrefresh(curscr) to redraw the screen. initscrO sets the signal SIGTSTP to 
trap to this routine. 







Appendix A 


1. Capabilities from termcap 

1.1. Disclaimer 

The description of terminals is a difficult business, and we only attempt to summarize the 
capabilities here: for a full description see the paper describing termcap. 

1.2. Overview 

Capabilities from termcap are of three kinds: string valued options, numeric valued op¬ 
tions, and boolean options. The string valued options are the most complicated, since they 
may include padding information, which we describe now. 

Intelligent terminals often require padding on intelligent operations at high (and some¬ 
times even low) speed. This is specified by a number before the string in the capability, and 
has meaning for the capabilities which have a P at the front of their comment. This normally 
is a number of milliseconds to pad the operation. In the current system which has no true pro¬ 
grammable delays, we do this by sending a sequence of pad characters (normally nulls, but can 
be changed (specified by PC)). In some cases, the pad is better computed as some number of 
milliseconds times the number of affected lines (to the bottom of the screen usually, except 
when terminals have insert modes which will shift several lines.) This is specified as, e.g., 12 *. 
before the capability, to say 12 milliseconds per affected whatever (currently always line). 
Capabilities where this makes sense say P*. 


1.3. Variables Set By setterm() 

variables set by setterm() 


Type 

Name 

Pad 

Description 

char * 

AL 

P* 

Add new blank Line 

bool 

AM 


Automatic Margins 

char * 

BC 


Back Cursor movement 

bool 

BS 


Backspace works 

char * 

BT 

P 

Back Tab 

bool 

CA 


Cursor Addressable 

char * 

CD 

P* 

Clear to end of Display 

char * 

CE 

P 

Clear to End of line 

char * 

CL 

P* 

CLear screen 

char * 

CM 

P 

Cursor Motion 

char * 

DC 

P* 

Delete Character 

char * 

DL 

P* 

Delete Line sequence 

char * 

DM 


Delete Mode (enter) 

char * 

DO 


DOwn line sequence 

char * 

ED 


End Delete mode 

bool 

EO 


can Erase Overstrikes with 

char * 

El 


End Insert mode 

char * 

HO 


HOme cursor 

bool 

HZ 


HaZeltine ~ braindamage 

char * 

IC 

P 

Insert Character 

bool 

IN 


Insert-Null blessing 

char * 

IM 


enter Insert Mode (IC usually set, too) 

char * 

IP 

P* 

Pad after char Inserted using IM+IE 

char * 

LL 


quick to Last Line, column 0 

char * 

MA 


Ctrl character MAp for cmd mode 
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variables set by setterm() 


Type 

Name 

Pad 

Description 

bool 

MI 


can Move in Insert mode 

bool 

NC 


No Cr: \r sends \r\n then eats \n 

char * 

ND 


Non-Destructive space 

bool 

OS 


OverStrike works 

char 

PC 


Pad Character 

char * 

SE 


Standout End (may leave space) 

char * 

SF 

P 

Scroll Forwards 

char * 

SO 


Stand Out begin (may leave space) 

char * 

SR 

P 

Scroll in Reverse 

char * 

TA 

P 

TAb (not "I or with padding) 

char * 

TE 


Terminal address enable Ending sequence 

char * 

TI 


Terminal address enable Initialization 

char * 

UC 


Underline a single Character 

char * 

UE 


Underline Ending sequence 

bool 

UL 


Underlining works even though !OS 

char * 

UP 


UPline 

char * 

US 


Underline Starting sequence 10 

char * 

VB 


Visible Bell 

char * 

VE 


Visual End sequence 

char * 

VS 


Visual Start sequence 

bool 

XN 


a Newline gets eaten after wrap 

Names starting with X 

are reserved for severely nauseous glitches 

1.4. Variables Set By gettmode() 



variables set by gettmodeQ 

type 

name 


description 

bool 

NONL 


Term can’t hack linefeeds doing a CR 

bool 

GT 


Gtty indicates Tabs 

bool 

UPPERCASE 

Terminal generates only uppercase letters 


10 US and UE, if they do not exist in the termcap entry, are copied from SO and SE in settermQ 
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1 . 

The WINDOW structure 

The WINDOW structure is defined as follows: 


# define 


WINDOW struct 

_ win_ 

struct _ 

win st j 




short 

_cury, _curx; 



short 

_maxy, _maxx; 



short 

_begy, _begx; 



short 

_ flags; 



bool 

_ clear; 



bool 

_ leave; 



bool 

scroll; 



char 

**_y; 



short 

*_firstch; 


); 

short 

*_lastch; 


# define 


SUBWIN 

01 

# define 


ENDLINE 

02 

# define 


FULL WIN 

04 

# define 


SCROLL WIN 

010 

I define 


STANDOUT 

0200 


_cury and _curx are the current (y, x) co-ordinates for the window. New characters ad¬ 
ded to the screen are added at this point. _maxy and _maxx are the maximum values al¬ 
lowed for ( _cury , _curx). _begy and _begx are the starting (y, x) co-ordinates on the termi¬ 
nal for the window, i.e., the window’s home. _cury, _curx, _maxy, and _maxx are measured 
relative to (_ begy, _begx), not the terminal’s home. 

_ clear tells if a clear-screen sequence is to be generated on the next refresh() call. This 
is only meaningful for screens. The initial clear-screen for the first refreshf) call is generated 
by initially setting clear to be TRUE for cursor, which always generates a clear-screen if set, ir¬ 
relevant of the dimensions of the window involved. _ leave is TRUE if the current (y, x) co¬ 
ordinates and the cursor are to be left after the last character changed on the terminal, or not 
moved if there is no change. _scroll is TRUE if scrolling is allowed. 

_y is a pointer to an array of lines which describe the terminal. Thus: 

_y[i] 

is a pointer to the ith line, and 

_y[i]M 

is the yth character on the ith line. 

_ flags can have one or more values or’d into it. _ SUB WIN means that the window is 
a subwindow, which indicates to delu>in() that the space for the lines is not to be freed. 
_ENDLIN£ says that the end of the line for this window is also the end of a screen. 
_ FULL WIN says that this window is a screen. _ SCROLL WIN indicates that the last 
character of this screen is at the lower right-hand corner of the terminal; i.e., if a character was 
put there, the terminal would scroll. _ STANDOUT says that all characters added to the 
screen are in standout mode. 


11 All variables not normally accessed directly by the user are named with an initial to avoid conflicts with 
the user’s variables. 
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1. Examples 

Here we present a few examples of how to use the package. They attempt to be represen¬ 
tative, though not comprehensive. 

2. Screen Updating 

The following examples are intended to demonstrate the basic structure of a program us¬ 
ing the screen updating sections of the package. Several of the programs require calculational 
sections which are irrelevant of to the example, and are therefore usually not included. It is 
hoped that the data structure definitions give enough of an idea to allow understanding of what 
the relevant portions do. The rest is left as an exercise to the reader, and will not be on the fi¬ 
nal. 

2.1. Twinkle 

This is a moderately simple program which prints pretty patterns on the screen that 
might even hold your interest for 30 seconds or more. It switches between patterns of aster¬ 
isks, putting them on one by one in random order, and then taking them off in the same 
fashion. It is more efficient to write this using only the motion optimization, as is demonstrat¬ 
ed below. 

# include <curses.h> 

# include <signal.h> 


* the idea for this program was a product of the imagination of 

* Kurt Schoens. Not responsible for minds lost or stolen. 


*/ 


# define 

# define 

# define 


NCOLS 80 
NLINES 24 
MAXPATTERNS 4 


struct Iocs 


char y, x; 


typedef struct Iocs 


LOCS; 


LOCS Layout[NCOLS * NLINES]; I* current board layout */ 


int 


Pattern, 

Numstars; 


/* current pattern number */ 

/* number of stars in pattern *! 


main() 


char 

int 


*getenv(); 

die(); 


srand(getpidO); 


/* initialize random sequence */ 


initscr(); 

signal(SIGINT, die); 

noecho(); 

nonl(); 
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leaveok(stdscr, TRUE); 
scrollok(stdscr, FALSE); 


for (;;) ( 


makeboard(); 
puton('*'); 
puton(''); 


/* make the board setup */ 
/* put on '*'s */ 

/* cover up with''s */ 


I* 

* On program exit, move the cursor to the lower left corner by 

* direct addressing, since current location is not guaranteed. 

* We lie and say we used to be at the upper right corner to guarantee 

* absolute addressing. 


*/ 

die() 


signal(SIGINT, SIG_IGN); 
mvcur(0, COLS-1, LINES-1, 0); 
endwin(); 
exit(O); 


I* 

Make the current board setup. It picks a random pattern and 

* calls isonO to determine if the character is on that pattern 

* or not. 

*/ 

makeboardO { 


reg int 
reg LOCS 



Pattern = rand() % MAXPATTERNS; 
lp = Layout; 

for (y = 0; y < NLINES; y++) 

for (x = 0; x < NCOLS; x++) 
if (ison(y, x)) { 

lp >y = y; 

lp++ —>x = x; 

} 

Numstars = lp — Layout; 


/* 

* Return TRUE if (y, x) is on the current pattern. 
*/ 

ison(y, x) 

reg int y, x; { 


switch (Pattern) { 
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puton(ch) 
reg char 


case 0: /* alternating lines */ 

return !(y & 01); 
case 1: /* box */ 

if (x >= LINES && y >= NCOLS) 
return FALSE; 
if (y < 3 I y > = NLINES — 3) 
return TRUE; 

return (x < 3 I x >= NCOLS — 3); 
case 2: /* holy pattern! */ 

return ((x + y) & 01); 
case 3: /* bar across center */ 

return (y >= 9 && y <= 15); 

) 

/* NOTREACHED V 


ch; 


reg LOCS 

"ip; 

reg int 

r; 

reg LOCS 

*end; 

LOCS 

temp; 

end = &Layout[Numstars]; 

for (lp = 

Layout; lp < end; 


r = rand() % Numstars; 
temp = *lp; 

*lp = Layout [r]; 
Layout[r] = temp; 


for (lp = Layout; lp < end; lp++) ( 

mvaddch(lp —>y, lp —>x, ch); 

refreshO; 


2.2. Life 

This program plays the famous computer pattern game of life (Scientific American, May, 
1974). The calculational routines create a linked list of structures defining where each piece is. 
Nothing here claims to be optimal, merely demonstrative. This program, however, is a very 
good place to use the screen updating routines, as it allows them to worry about what the last 
position looked like, so you don’t have to. It also demonstrates some of the input routines. 

# include <curses.h> 

# include <signal.h> 


Run a life game. This is a demonstration program for 
the Screen Updating section of the —leurses cursor package. 
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Appendix C 


struct 1st 

_st ( 

/* linked list element */ 


int 

y> x; /* (y, x) position of piece */ 

); 

struct 1st_st 

*next, *last; /* doubly linked */ 

typedef struct 1st st 

LIST; 

LIST 

♦Head; 

/* head of linked list */ 

main(ac, av) 


int 

ac; 


char 

*av[]; ( 



int die(); 



evalargs(ac, av); 

/* evaluate arguments */ 


initscr(); 

/* initialize screen package */ 


signal(SIGINT, die); 

/* set to restore tty stats */ 


crmodeO; 

/* set for char—by—char */ 


noecho(); 

/* 


nonl(); 

/* for optimization */ 


getstartO; 

/* get starting position */ 


for (;;) { 



prboardO; 

/* print out current board */ 


updateO; 

/* update board position */ 


/* 

* This is the routine which is called when rubout is hit. 

* It resets the tty stats to their original values. This 

* is the normal way of leaving the program. 

*/ 

die() ( 

signal(SIGINT, SIG_IGN); /* ignore rubouts */ 

mvcur(0, COLS —1, LINES—1, 0); /* go to bottom of screen */ 

endwin(); /* set terminal to initial state */ 

exit(O); 


/* 

* Get the starting position from the user. They keys u, i, o, j, l, 

* m , „ and . are used for moving their relative directions from the 

* k key. Thus , u move diagonally up to the left , , moves directly down , 

* etc. x places a piece at the current position , " " takes it away. 

* The input can also be from a file. The list is built after the 

* board setup is ready. 

*/ 

getstartO ( 
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Appendix C 


reg char c; 

reg int x * YI 

/* box in the screen */ 

/* move to upper left corner */ 


box(stdscr, T, 
move(l, 1); 


refresh(); /* print current position */ 

if ((c = getch()) == V) 

break; 
switch (c) ( 
case 'u': 
case i': 
case o': 
case j': 
case '1': 
case 'm': 
case 
case 7: 

adjustyx(c); 

break; 

case T: 

mvaddstr(0, 0, ”File name: "); 
getstr(buO; 

readfile(buf); 

break; 

case x': 

addch('X'); 

break; 

case '': 

addch(''); 

break; 


if (Head ! = NULL) '* start new list */ 

dellist(Head); 

Head = malloc(sizeof (LIST)); 

/* 

* loop through the screen looking for x s, and add a list 

* element for each one 
*/ 

for (y = 1; y < LINES - 1; y++) 

for (x = 1; x < COLS - 1; x++) { 
move(y, x); 
if (inch() = = x') 

addlist(y, x); 
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Appendix C 




/* 

* Print out the current board position from the linked list 
*/ 

prboardO { 

reg LIST *hp; 

erase(); /* clear out last position */ 

box(stdscr, T, _ ); /* box in the screen */ 

/* 

* go through the list adding each piece to the newly 

* blank board 
*/ 

tor (hp = Head; hp; hp = hp->next) 

mvaddch(hp—>y, hp->x, 'X'); 

refresh(); 


3. Motion optimization 

The following example shows how motion optimization is written on its own. Programs 
which flit from one place to another without regard for what is already there usually do not 
need the overhead of both space and time associated with screen updating. They should in¬ 
stead use motion optimization. 

3.1. Twinkle 

The twinkle program is a good candidate for simple motion optimization. Here is how 
it could be written (only the routines that have been changed are shown): 

main() { 


reg char *sp; 

char *getenv(); 

iot _putchar(), die(); 

srand(getpidO); /* initialize random sequence V 

if (isatty(O)) | 
gettmodeO; 

if (sp=getenv("TERM")) 
setterm(sp); 
signal(SIGINT, die); 


else { 

printf("Need a terminal on %d\n", _tty_ch); 
exit(l); 

} 

_puts(TI); 

_puts(VS); 


noecho(); 

nonl(); 
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Appendix C 


tputs(CL, NLINES, _putchar); 
for (;;) { 

makeboardO; 
puton('*'); 
putonC '); 


/* make the board setup */ 
/* put on '*'s */ 

/* cover up with''s */ 


/* 

* _putchar defined for tputs() (and __puts()) 
*/ 

_putchar(c) 

reg char c; { 

putchar(c); 


puton(ch) 

char ch; { 

static int 

reg LOCS 
reg int 
reg LOCS 
LOCS 


lasty, lastx; 

*ip; 

r; 

^end; 

temp; 


end = &Layout[Numstars]; 
for (lp = Layout; lp < end; lp++) ( 
r = randO % Numstars; 
temp = *lp; 

*\p = Layout [r]; 
Layout[r] = temp; 


for (lp = Layout; lp < end; lp++) 

/* prevent scrolling */ 

if (!AM I (lp — >y < NLINES - 1 I lp->x < NCOLS - 1)) { 
mvcur(lasty, lastx, lp — >y, lp —>x); 
putchar (ch); 
lasty = lp->y; 

if ((lastx = lp —>x + 1) >= NCOLS) 
if (AM) { 

lastx = 0; 
lasty++; 


else 

lastx = NCOLS — 1; 
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ABSTRACT 

This document describes the structure and installation procedure for the 
line printer spooling system developed for the 4.2BSD version of the UNIX* 
operating system. 


1. Overview 

The line printer system supports: 

• multiple printers, 

• multiple spooling queues, 

• both local and remote printers, and 

• printers attached via serial lines which require line initialization such as the baud rate. 

Raster output devices such as a Varian or Versatec, and laser printers such as an Imagen, are 
also supported by the line printer system. 

The line printer system consists mainly of the following files and commands: 


/etc/printcap 

/usr/lib/lpd 

/usr/ucb/lpr 

/usr/ucb/lpq 

/usr/ucb/lprm 

/etc/lpc 

/dev/printer 


printer configuration and capability data base 

line printer daemon, does all the real work 

program to enter a job in a printer queue 

spooling queue examination program 

program to delete jobs from a queue 

program to administer printers and spooling queues 

socket on which lpd listens 


The file /etc/printcap is a master data base describing line printers directly attached to a 
machine and, also, printers accessible across a network. The manual page entry printcap( 5) 
provides the ultimate definition of the format of this data base, as well as indicating default 
values for important items such as the directory in which spooling is performed. This docu¬ 
ment highlights the important information which may be placed printcap. 


* UNIX is a trademark of Bell Laboratories. 
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2. Commands 


2.1. lpd - line printer dameon 

The program lpd( 8), usually invoked at boot time from the /etc/rc file, acts as a master 
server for coordinating and controlling the spooling queues configured in the prmtcap file. 
When lpd is started it makes a single pass through the prmtcap database restarting any 
printers which have jobs. In normal operation lpd listens for service requests on multiple sock¬ 
ets one in the UNIX domain (named “/dev/printer”) for local requests, and one in the Inter¬ 
net domain (under the “printer” service specification) for requests for printer access from off 
machine; see socket (2) and services^ 5) for more information on sockets and service 
specifications, respectively. Lpd spawns a copy of itself to process the request; the master dae¬ 
mon continues to listen for new requests. 

Clients communicate with lpd using a simple transaction oriented protocol. Authentica¬ 
tion of remote clients is done based on the “privilege port” scheme employed by rshd( 8C) and 
rcmdl 3X) The following table shows the requests understood by lpd. In each request the hrst 
byte indicates the “meaning” of the request, followed by the name of the printer to which ,t 
should be applied. Additional qualifiers may follow, depending on the request. 


Request ___ 

/ 'Aprinter\n 

''BprinterXn 

''Cprinter [users ...] [jobs ...]\n 
"Dprinter [users ...] [jobs ...]\n 
"Eprinter person [users ...] [jobs ...]\n 


Interpretation ___ 

check the queue for jobs and print any found 
receive and queue a job from another machine 
return short list of current queue state 
return long list of current queue state 
remove jobs from a queue 


The /pr(l) command is used by users to enter a print job in a local queue and to notify 
the local lpd that there are new jobs in the spooling area. Lpd either schedules the job to be 
printed locally, or in the case of remote printing, attempts to forward the job to the appropri¬ 
ate machine. If the printer cannot be opened or the destination machine is unreachable, the 
job will remain queued until it is possible to complete the work. 

2.2. lpq - show line printer queue 

The lpq( 1) program works recursively backwards displaying the queue of the machine 
with the printer and then the queue(s) of the machine(s) that lead to it. Lpq has two forms of 
output: in the default, short, format it gives a single line of output per queued job; in the long 
format it shows the list of files, and their sizes, which comprise a job. 


2.3. lprm - remove jobs from a queue 

The lprm( 1) command deletes jobs from a spooling queue. If necessary, lprm will first 
kill off a running daemon which is servicing the queue, restarting it after the required files are 
removed. When removing jobs destined for a remote printer, lprm acts similarly to (pq except 
it first checks locally for jobs to remove and then tries to remove files in queues off-machme. 

2.4. lpc - line printer control program 

The lpc( 8) program is used by the system administrator to control the operation of the 
line printer system. For each line printer configured in /etc/printcap, lpc may be used to: 

• disable or enable a printer, 

• disable or enable a printer’s spooling queue, 

• rearrange the order of jobs in a spooling queue, 

• find the status of printers, and their associated spooling queues and printer dameons. 
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3. Access control 

The printer system maintains protected spooling areas so that users cannot circumvent 
printer accounting or remove files other than their own. The strategy used to maintain pro¬ 
tected spooling areas is as follows: 

• The spooling area is writable only by a daemon user and spooling group. 

• The Ipr program runs setuid root and setgid spooling. The root access is used to read any 
file required, verifying accessibility with an access( 2) call. The group ID is used in setting 
up proper ownership of files in the spooling area for Iprm. 

• Control files in a spooling area are made with daemon ownership and group ownership 
spooling. Their mode is 0660. This insures control files are not modified by a user and 
that no user can remove files except through Iprm. 

• The spooling programs, lpd y Ipq , and Iprm run setuid root and setgid spooling to access 
spool files and printers. 

• The printer server, Ipd, uses the same verification procedures as rshd( 8C) in authenticating 
remote clients. The host on which a client resides must be present in the file 
/etc/hosts.equiv, used to create clusters of machines under a single administration. 

In practice, none of Ipd, Ipq, or Iprm would have to run as user root if remote spooling 
were not supported. In previous incarnations of the printer system Ipd ran setuid daemon , set¬ 
gid spooling , and Ipq and Iprm ran setgid spooling. 

4. Setting up 

The 4.2BSD release comes with the necessary programs installed and with the default 
line printer queue created. If the system must be modified, the makefile in the directory 
/usr/src/usr.lib/lpr should be used in recompiling and reinstalling the necessary programs. 

The real work in setting up is to create the printcap file and any printer filters for 
printers not supported in the distribution system. 

4.1. Creating a printcap file 

The printcap database contains one or more entries per printer. A printer should have a 
separate spooling directory; otherwise, jobs will be printed on different printers depending on 
which printer daemon starts first. This section describes how to create entries for printers 
which do not conform to the default printer description (an LP-11 style interface to a stan¬ 
dard, band printer). 

4.1.1. Printers on serial lines 

When a printer is connected via a serial communication line it must have the proper 
baud rate and terminal modes set. The following example is for a DecWriter III printer con¬ 
nected locally via a 1200 baud serial line. 

lplLA-180 DecWriter III:\ 

:lp = /dev/lp:br#1200:fs#06320:\ 

:tr = \f:of = /usr/lib/lpf:lf = /usr/adm/lpd-errs: 

The lp entry specifies the file name to open for output. In this case it could be left out since 
“/dev/lp” is the default. The br entry sets the baud rate for the tty line and the fs entry sets 
CRMOD, no parity, and XTABS (see tty( 4)). The tr entry indicates a form-feed should be 
printed when the queue empties so the paper can be torn off without turning the printer off¬ 
line and pressing form feed. The of entry specifies the filter program Ipf should be used for 
printing the files; more will be said about filters later. The last entry causes errors to be writ¬ 
ten to the file “/usr/adm/lpd-errs” instead of the console. 
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4.1.2. Remote printers 

Printers which reside on remote hosts should have an empty lp entry. For example, the 
following printcap entry would send output to the printer named “lp” on the machine 
“ucbvax”. 

lpldefault line printer:\ 

:lp = :rm = ucbvax:rp = lp:sd = /usr/spool/vaxlpd: 

The rm entry is the name of the remote machine to connect to; this name must appear in the 
/etc/hosts database, see hosts( 5). The rp capability indicates the name of the printer on the 
remote machine is “lp”; in this case it could be left out since this is the default value. The sd 
entry specifies “/usr/spool/vaxlpd” as the spooling directory instead of the default value of 
“/usr/spool/lpd”. 

4.2. Output filters 

Filters are used to handle device dependencies and to perform accounting functions. The 
output filter of is used to filter text data to the printer device when accounting is not used or 
when all text data must be passed through a filter. It is not intended to perform accounting 
since it is started only once, all text files are filtered through it, and no provision is made for 
passing owners’ login name, identifying the begining and ending of jobs, etc. The other filters 
(if specified) are started for each file printed and perform accounting if there is an af entry. If 
entries for both of and one of the other filters are specified, the output filter is used only to 
print the banner page; it is then stopped to allow other filters access to the printer. An exam¬ 
ple of a printer which requires output filters is the Benson-Varian. 

valvarianlBenson-Varian:\ 

:lp = /dev/vaO:sd = /usr/spool/vad:of= /usr/lib/vpf:\ 

:tf = /usr/lib/rvcat:mx#2000:pl#58:tr=\f: 

The tf entry specifies “/usr/lib/rvcat” as the filter to be used in printing troff( 1) output. This 
filter is needed to set the device into print mode for text, and plot mode for printing troff files 
and raster images (see i>a(4V)). Note that the page length is set to 58 lines by the pi entry for 
8.5” by 11” fan-fold paper. To enable accounting, the varian entry would be augmented with 
an af filter as shown below. 

valvarianlBenson-Varian:\ 

:lp = /dev/vaO:sd= /usr/spool/vad:of= /usr/lib/vpf:\ 

:if = /usr/lib/vpf:tf = /usr/lib/rvcat: af = /usr/adm/vaacct:\ 

:mx#2000:pl#58:tr = \f: 

5. Output filter specifications 

The filters supplied with 4.2BSD handle printing and accounting for most common line 
printers, the Benson-Varian, the wide (36”) and narrow (11”) Versatec printer/plotters. For 
other devices or accounting methods, it may be necessary to create a new filter. 

Filters are spawned by Ipd with their standard input the data to be printed, and standard 
output the printer. The standard error is attached to the If file for logging errors. A filter 
must return a 0 exit code if there were no errors, 1 if the job should be reprinted, and 2 if the 
job should be thrown away. When Iprm sends a kill signal to the Ipd process controlling print¬ 
ing, it sends a SIGINT signal to all filters and descendents of filters. This signal can be 
trapped by filters which need to perform cleanup operations such as deleting temporary files. 

Arguments passed to a filter depend on its type. The of filter is called with the following 
arguments. 

ofder -wwidth -llength 

The width and length values come from the pw and pi entries in the printcap database. The 
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if filter is passed the following parameters. 

filter [-c] —wwidth -llength -iindent -n login -h host accounting_file 

The -c flag is optional, and only supplied when control characters are to be passed uninter¬ 
preted to the printer (when the -1 option of Ipr is used to print the file). The -w and -1 
parameters are the same as for the of filter. The -n and -h parameters specify the login 
name and host name of the job owner. The last argument is the name of the accounting file 
from printcap. 

All other filters are called with the following arguments: 

filter —xwidth —ylength —n login —h host accounting_file 

The —x and —y options specify the horizontal and vertical page size in pixels (from the px and 
py entries in the printcap file). The rest of the arguments are the same as for the if filter. 

6. Line printer Administration 

The Ipc program provides local control over line printer activity. The major commands 
and their intended use will be described. The command format and remaining commands are 
described in lpc( 8). 

abort and start 

Abort terminates an active spooling daemon on the local host immediately and then dis¬ 
ables printing (preventing new daemons from being started by Ipr). This is normally 
used to forciblly restart a hung line printer daemon (i.e., Ipq reports that there is a dae¬ 
mon present but nothing is happening). It does not remove any jobs from the queue (use 
the Iprm command instead). Start enables printing and requests Ipd to start printing 
jobs. 

enable and disable 

Enable and disable allow spooling in the local queue to be turned on/off. This will 
allow/prevent Ipr from putting new jobs in the spool queue. It is frequently convenient to 
turn spooling off while testing new line printer filters since the root user can still use Ipr 
to put jobs in the queue but no one else can. The other main use is to prevent users from 
putting jobs in the queue when the printer is expected to be unavailable for a long time, 
restart 

Restart allows ordinary users to restart printer daemons when Ipq reports that there is no 
daemon present. 

stop 

Stop is used to halt a spooling daemon after the current job completes; this also disables 
printing. This is a clean way to shutdown a printer in order to perform maintenence, etc. 
Note that users can still enter jobs in a spool queue while a printer is stopped. 

topq 

Topq places jobs at the top of a printer queue. This can be used to reorder high priority 
jobs since Ipr only only provides first-come-first-serve ordering of jobs. 

7. Troubleshooting 

There are a number of messages which may be generated by the the line printer system. 
This section categorizes the most common and explains the cause for their generation. Where 
the message indicates a failure, directions are given to remedy the problem. 

In the examples below, the name printer is the name of the printer. This would be one of 
the names from the printcap database. 
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7.1. LPR 


lpr: printer: unknown printer 

The printer was not found in the printcap database. Usually this is a typing mistake; 
however, it may indicate a missing or incorrect entry in the /etc/printcap file. 

lpr: printer: jobs queued, but cannot start daemon. 

The connection to Ipd on the local machine failed. This usually means the printer server 
started at boot time has died or is hung. Check the local socket /dev/printer to be sure it 
still exists (if it does not exist, there is no Ipd process running). Use 

% ps ax I fgrep Ipd 

to get a list of process identifiers of running lpd’s. The Ipd to kill is the one which is not 
listed in any of the “lock" files (the lock file is contained in the spool directory of each 
printer). Kill the master daemon using the following command. 

% kill pid 

Then remove /dev/printer and restart the daemon (and printer) with the following com¬ 
mands. 

% rm /dev/printer 
% /usr/lib/lpd 

Another possibility is that the lpr program is not setuid root, setgid spooling. This can be 
checked with 

% Is -lg /usr/ucb/lpr 

lpr: printer: printer queue is disabled 

This means the queue was turned off with 
% lpc disable printer 

to prevent lpr from putting files in the queue. This is normally done by the system 
manager when a printer is going to be down for a long time. The printer can be turned 
back on by a super-user with lpc. 

7.2. LPQ 

waiting for printer to become ready (offline ?) 

The printer device could not be opened by the daemon. This can happen for a number of 
reasons, the most common being that the printer is turned off-line. This message can 
also be generated if the printer is out of paper, the paper is jammed, etc. The actual rea¬ 
son is dependent on the meaning of error codes returned by system device driver. Not all 
printers supply sufficient information to distinguish when a printer is off-line or having 
trouble (e.g. a printer connected through a serial line). Another possible cause of this 
message is some other process, such as an output filter, has an exclusive open on the dev¬ 
ice. Your only recourse here is to kill off the offending program(s) and restart the printer 
with lpc. 

printer is ready and printing 

The Ipq program checks to see if a daemon process exists for printer and prints the file 
status. If the daemon is hung, a super user can use lpc to abort the current daemon and 
start a new one. 
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waiting for host to come up 

This indicates there is a daemon trying to connect to the remote machine named host in 
order to send the files in the local queue. If the remote machine is up, Ipd on the remote 
machine is probably dead or hung and should be restarted as mentioned for Ipr. 

sending to host 

The files should be in the process of being transferred to the remote host. If not, the 
local daemon should be aborted and started with Ipc. 

Warning: printer is down 

The printer has been marked as being unavailable with Ipc. 

Warning: no daemon present 

The Ipd process overseeing the spooling queue, as indicated in the “lock” file in that 
directory, does not exist. This normally occurs only when the daemon has unexpectedly 
died. The error log file for the printer should be checked for a diagnostic from the 
deceased process. To restart an Ipd , use 

% Ipc restart printer 


7.3. LPRM 

lprm: printer : cannot restart printer daemon 

This case is the same as when Ipr prints that the daemon cannot be started. 

7.4. LPD 

The Ipd program can write many different messages to the error log file (the file specified 
in the If entry in printcap). Most of these messages are about files which can not be opened 
and usually indicate the printcap file or the protection modes of the files are not correct. Files 
may also be inaccessible if people manually manipulate the line printer system (i.e. they bypass 
the Ipr program). 

In addition to messages generated by Ipd , any of the filters that Ipd spawns may also log 
messages to this file. 

7.5. LPC 

could’t start printer 

This case is the same as when Ipr reports that the daemon cannot be started. 

cannot examine spool directory 

Error messages beginning with “cannot ...” are usually due to incorrect ownership and/or 
protection mode of the lock file, spooling directory or the Ipc program. 
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PART 5: OTHER LANGUAGES 


The articles in this part are reference materials appropriate for people familiar with 
programming in the languages described. Each article defines the implementation of 
a language on the ULTRIX-32m system. With the exception of the article on 
Pascal, these articles are not tutorial, and they are not for beginners. 

Berkeley Pascal 

The “Berkeley Pascal User’s Manual” tells what you need to know to write and 
execute Pascal programs on the ULTRIX-32m system if you are already familiar 
with Pascal programming. The article is arranged in tutorial format; it lists 
reference materials, explains how to use an editor to create a Pascal program, and 
gives various execution options. Berkeley Pascal includes six utilities for translating, 
compiling, running, and analyzing programs: 

pi Translates the source program into interpreter code and stores the 
interpreter code 

px Interprets (executes) the interpreter code created by pi 

pix Translates the source program and then executes it 

pc Processes the source program to compile an executable binary file 

P X P Creates an execution profile for a program when used together with pi or 




pix 


pxref Produces a program listing and a cross-reference identifier from a source 


program 


“The Berkeley Pascal User’s Manual” explains how to use these utilities, how to 
handle piping, input, and output, how to interpret error diagnostics, how to include 
source text from several files for the translator, and how to compile separate 
segments of a Pascal program to be linked for running later. An appendix gives a 
precise definition of Berkeley Pascal. 



Fortran 


Two articles that follow describe f77 Fortran. The “Introduction to the f77 I/O 
Library,” by Wasley, lists specifications and rules for using the f77 I/O library 
routines. These routines make use of the standard C I/O library routines in 
ULTRIX-32m. The article explains the different methods available for accessing 
files, rules for use of logical units for I/O, and error and status handling for I/O 
processing. It tells in detail how the standard Fortran commands and concepts are 
implemented on the ULTRIX-32m system. In addition, the article identifies non- 
ANSI standard extensions to the library and shows methods you can use to make 
older Fortran programs compatible with this I/O library. 

A Portable Fortran 77 Compiler,” by Feldman and Weinberger, describes the rules 
and conventions of Fortran 77 as implemented on the ULTRIX-32m system. 
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William N. Joy, Susan L. Graham, Charles B. Haleyf, 
Marshall Kirk McKusick, and Peter B. Kessler 

Computer Science Division 

Department of Electrical Engineering and Computer Science 
University of California, Berkeley 
Berkeley, California 94720 


ABSTRACT 

Berkeley Pascal is designed for interactive instructional use and runs on 
the PDP/11 and VAX/ll computers. Interpretive code is produced, providing 
fast translation at the expense of slower execution speed. There is also a fully 
compatible compiler for the VAX/11. An execution profiler and Wirth’s cross 
reference program are also available with the system. 

The system supports full Pascal. The language accepted is ‘standard’ 
Pascal, and a small number of extensions. There is an option to suppress the 
extensions. The extensions include a separate compilation facility and the abil¬ 
ity to link to object modules produced from other source languages. 

The User's Manual gives a list of sources relating to the UNIXf system, 
the Pascal language, and the Berkeley Pascal system. Basic usage examples are 
provided for the Pascal components pi, px , pix , pc, and pxp . Errors commonly 
encountered in these programs are discussed. Details are given of special con¬ 
siderations due to the interactive implementation. A number of examples are 
provided including many dealing with input/output. An appendix supplements 
Wirth’s Pascal Report to form the full definition of the Berkeley implementa¬ 
tion of the language. 
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Version 3,0 - July 1983 

William N. Joy , Susan L. Graham , Charles B. Haleyf, 
Marshall Kirk McKusick, and Peter B. Kessler 

Computer Science Division 

Department of Electrical Engineering and Computer Science 
University of California, Berkeley 
Berkeley, California 94720 


Introduction 

The Berkeley Pascal User's Manual consists of five major sections and an appendix. In 



section 1 we give sources of information about UNIX, about the programming language Pascal, 


and about the Berkeley implementation of the language. Section 2 introduces the Berkeley 
implementation and provides a number of tutorial examples. Section 3 discusses the error 
diagnostics produced by the translators pc and pi , and the runtime interpreter px. Section 4 


describes input/output with special attention given to features of the interactive implementa¬ 


tion and to features unique to UNIX. Section 5 gives details on the components of the system 


and explanation of all relevant options. The User's Manual concludes with an appendix to 


Wirth’s Pascal Report with which it forms a precise definition of the implementation. 


History of the implementation 

The first Berkeley system was written by Ken Thompson in early 1976. The main 
features of the present system were implemented by Charles Haley and William Joy during the 
latter half of 1976. Earlier versions of this system have been in use since January, 1977. 

The system was moved to the VAX- 11 by Peter Kessler and Kirk McKusick with the port¬ 
ing of the interpreter in the spring of 1979, and the implementation of the compiler in the 
summer of 1980. 
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1. Sources of information 

This section lists the resources available for information about general features of UNIX, 
text editing, the Pascal language, and the Berkeley Pascal implementation, concluding with a 
list of references. The available documents include both so-called standard documents - those 
distributed with all UNIX system - and documents (such as this one) written at Berkeley. 

1.1. Where to get documentation 

Current documentation for most of the UNIX system is available “on line” at your termi¬ 
nal. Details on getting such documentation interactively are given in section 1.3. 

1.2. Documentation describing UNIX 

The following documents are those recommended as tutorial and reference material about 
the UNIX system. We give the documents with the introductory and tutorial materials first, 
the reference materials last. 

UNIX For Beginners - Second Edition 

This document is the basic tutorial for UNIX available with the standard system. 

Communicating with UNIX 

This is also a basic tutorial on the system and assumes no previous familiarity with com¬ 
puters; it was written at Berkeley. 

An introduction to the C shell 

This document introduces csh, the shell in common use at Berkeley, and provides a good 
deal of general description about the way in which the system functions. It provides a useful 
glossary of terms used in discussing the system. 

UNIX Programmer’s Manual 

This manual is the major source of details on the components of the UNIX system. It 
consists of an Introduction, a permuted index, and eight command sections. Section 1 consists 
of descriptions of most of the commands” of UNIX. Most of the other sections have limited 
relevance to the user of Berkeley Pascal, being of interest mainly to system programmers. 

UNIX documentation often refers the reader to sections of the manual. Such a reference 
consists of a command name and a section number or name. An example of such a reference 
would be: ed (1). Here ed is a command name - the standard UNIX text editor, and ‘(1)’ indi¬ 
cates that its documentation is in section 1 of the manual. 

The pieces of the Berkeley Pascal system are pi (1), px (1), the combined Pascal transla¬ 
tor and interpretive executor pix (1), the Pascal compiler pc (1), the Pascal execution profiler 
pxp (1), and the Pascal cross-reference generator pxref (1). 

It is possible to obtain a copy of a manual section by using the man (1) command. To 
get the Pascal documentation just described one could issue the command: 

% man pi 

to the shell. The user input here is shown in bold face; the ’, which was printed by the 
shell as a prompt, is not. Similarly the command: 

% man man 

asks the man command to describe itself. 
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1.3. Text editing documents 

The following documents introduce the various UNIX text editors. Most Berkeley users 
use a version of the text editor ex; either edit, which is a version of ex for new and casual users, 
ex itself, or vi (visual) which focuses on the display editing portion of ex. 

A Tutorial Introduction to the UNIX Text Editor 

This document, written by Brian Kernighan of Bell Laboratories, is a tutorial for the 
standard UNIX text editor ed. It introduces you to the basics of text editing, and provides 
enough information to meet day-to-day editing needs, for ed users. 

Edit: A tutorial 

This introduces the use of edit, an editor similar to ed which provides a more hospitable 
environment for beginning users. 


Ex/edit Command Summary 

This summarizes the features of the editors ex and edit in a concise form. If you have 
used a line oriented editor before this summary alone may be enough to get you started. 

Ex Reference Manual - Version 3.5 

A complete reference on the features of ex and edit. 

An Introduction to Display Editing with Vi 

Vi is a display oriented text editor. It can be used on most any CRT terminal, and uses 
the screen as a window into the file you are editing. Changes you make to the file are reflected 
in what you see. This manual serves both as an introduction to editing with vi and a reference 

manual. 

Vi Quick Reference 

This reference card is a handy quick guide to vi; you should get one when you get the 
introduction to vi. 

1.4. Pascal documents - The language 

This section describes the documents on the Pascal language which are likely to be most 
useful to the Berkeley Pascal user. Complete references for these documents are given in sec¬ 
tion 1.7. 

Pascal User Manual 

By Kathleen Jensen and Niklaus Wirth, the User Manual provides a tutorial introduction 
to the features of the language Pascal, and serves as an excellent quick-reference to t e 
language. The reader with no familiarity with Algol-like languages may prefer one of the Pas¬ 
cal text books listed below, as they provide more examples and explanation. Particular y 
important here are pages 116-118 which define the syntax of the language. Sections 13 and 14 
and Appendix F pertain only to the 6000-3.4 implementation of Pascal. 

Pascal Report 

By Niklaus Wirth, this document is bound with the User Manual. It is the guiding refer¬ 
ence for implementors and the fundamental definition of the language. Some programmers 
find this report too concise to be of practical use, preferring the User Manual as a reference. 
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Books on Pascal 

Several good books which teach Pascal or use it as a medium are available. The books by 
Wirth Systematic Programming and Algorithms + Data Structures = Programs use Pascal as a 
vehicle for teaching programming and data structure concepts respectively. They are both 
recommended. Other books on Pascal are listed in the references below. 

1.5. Pascal documents — The Berkeley Implementation 

This section describes the documentation which is available describing the Berkeley 
implementation of Pascal. 

User’s Manual 

The document you are reading is the User's Manual for Berkeley Pascal. We often refer 
the reader to the Jensen-Wirth User Manual mentioned above, a different document with a 
similar name. 

Manual sections 

The sections relating to Pascal in the UNIX Programmer's Manual are pix (1), pi (1), pc 
(1), px (1), pxp (1), and pxref (1). These sections give a description of each program, summar¬ 
ize the available options, indicate files used by the program, give basic information on the diag¬ 
nostics produced and include a list of known bugs. 

Implementation notes 

For those interested in the internal organization of the Berkeley Pascal system there are 
a series of Implementation Notes describing these details. The Berkeley Pascal PXP Imple¬ 
mentation Notes describe the Pascal interpreter px ; and the Berkeley Pascal PX Implementa¬ 
tion Notes describe the structure of the execution profiler pxp. 

1.6. References 
UNIX Documents 

Communicating With UNIX 
Computer Center 
University of California, Berkeley 
January, 1978. 

Edit: a tutorial 

Ricki Blau and James Joyce 

Computing Services Division, Computing Affairs 

University of California, Berkeley 

January, 1978. 

Ex/edit Command Summary 
Computer Center 
University of California, Berkeley 
August, 1978. 

Ex Reference Manual - Version 3.5 
An Introduction to Display Editing with Vi 
Vi Quick Reference 
William Joy 

Computer Science Division 
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Department of Electrical Engineering and Computer Science 
University of California, Berkeley 
October, 1980. 

An Introduction to the C shell (Revised) 

William Joy 

Computer Science Division 

Department of Electrical Engineering and Computer Science 
University of California, Berkeley 
October, 1980. 

Brian W. Kernighan 

UNIX for Beginners - Second Edition 

Bell Laboratories 

Murray Hill, New Jersey. 

Brian W. Kernighan 

A Tutorial Introduction to the UNIX Text Editor 
Bell Laboratories 
Murray Hill, New Jersey. 

Dennis M. Ritchie and Ken Thompson 
The UNIX Time Sharing System 
Communications of the ACM 
July 1974 
365-378. 

B. W. Kernighan and M. D. Mcllroy 

UNIX Programmer's Manual — Seventh Edition 

Bell Laboratories 

Murray Hill, New Jersey 

December, 1978. 

(Virtual VAX/11 Version, 

U. C. Berkeley 
Berkeley, Ca. 

November, 1980.) 

Pascal Language Documents 

Conway, Gries and Zimmerman 
A Primer on PASCAL 
Winthrop, Cambridge Mass. 

1976, 433 pp. 

Kathleen Jensen and Niklaus Wirth 
Pascal - User Manual and Report 
Springer-Verlag, New York. 

1975, 167 pp. 
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C. A. G. Webster 
Introduction to Pascal 
Heyden and Son, New York. 
1976, 129pp. 


Niklaus Wirth 

Algorithms 4- Data structures = Programs 
Prentice-Hall, New York. 

1976, 366 pp. 


Niklaus Wirth 
Systematic Programming 
Prentice-Hall, New York. 
1973, 169 pp. 


Berkeley Pascal documents 

The following documents are available from the Computer Center Library at the Univer¬ 
sity of California, Berkeley. 


William N. Joy, Susan L. Graham, and Charles B. Haley 
Berkeley Pascal User's Manual - Version 2.0 
October 1980. 


William N. Joy 

Berkeley Pascal PX Implementation Notes 
Version 1.1, April 1979. 

(Vax-11 Version 2.0 By Kirk McKusick, December, 1979) 


William N. Joy 

Berkeley Pascal PXP Implementation Notes 
Version 1.1, April 1979. 
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2. Basic UNIX Pascal 

The following sections explain the basics of using Berkeley Pascal. In examples here we 
use the text editor ex (1). Users of the text editor ed should have little trouble following these 
examples, as ex is similar to ed. We use ex because it allows us to make clearer examples.*)* The 
new UNIX user will find it helpful to read one of the text editor documents described in section 
1.4 before continuing with this section. 

2.1. A first program 

To prepare a program for Berkeley Pascal we first need to have an account on UNIX and 
to ‘login’ to the system on this account. These procedures are described in the documents 
Communicating with UNIX and UNIX for Beginners. 

Once we are logged in we need to choose a name for our program; let us call it first as 
this is the first example. We must also choose a name for the file in which the program will be 
stored. The Berkeley Pascal system requires that programs reside in files which have names 
ending with the sequence ‘.p’ so we will call our file ‘first.p’. 

A sample editing session to create this file would begin: 

% ex first.p 
"first.p” [New file] 

We didn’t expect the file to exist, so the error diagnostic doesn’t bother us. The editor now 
knows the name of the file we are creating. The V prompt indicates that it is ready for com¬ 
mand input. We can add the text for our program using the ‘append’ command as follows. 

rappend 

program first(output) 
begin 

writeln("Hello, world!") 

end. 


The line containing the single ‘.’ character here indicated the end of the appended text. The ‘:’ 
prompt indicates that ex is ready for another command. As the editor operates in a temporary 
work space we must now store the contents of this work space in the file ‘first.p so we can use 
the Pascal translator and executor pix on it. 

:write 

"first.p” [New file] 4 lines, 59 characters 

:quit 

% 

We wrote out the file from the edit buffer here with the ‘write’ command, and ex indicated the 
number of lines and characters written. We then quit the editor, and now have a prompt from 
the shell.t 


t Users with CRT terminals should find the editor ui more pleasant to use; we do not show its use here be¬ 
cause its display oriented nature makes it difficult to illustrate, 
t Our examples here assume you are using csh. 
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We are ready to try to translate and execute our program. 

% pix first .p 

Wed Oct 29 17:11 1980 first.p: 

2 begin 

e —Inserted 

Execution begins... 

Hello, world! 

Execution terminated. 

1 statements executed in 0000.80 seconds cpu time. 

% 

The translator first printed a syntax error diagnostic. The number 2 here indicates that 
the rest of the line is an image of the second line of our program. The translator is saying that 
it expected to find a Y before the keyword begin on this line. If we look at the Pascal syntax 
charts in the Jensen-Wirth User Manual , or at some of the sample programs therein, we will 
see that we have omitted the terminating Y of the program statement on the first line of our 
program. 

One other thing to notice about the error diagnostic is the letter V at the beginning. It 
stands for ‘error’, indicating that our input was not legal Pascal. The fact that it is an V 
rather than an ‘E’ indicates that the translator managed to recover from this error well enough 
that generation of code and execution could take place. Execution is possible whenever no 
fatal ‘E’ errors occur during translation. The other classes of diagnostics are ‘w’ warnings, 
which do not necessarily indicate errors in the program, but point out inconsistencies which 
are likely to be due to program bugs, and ‘s’ standard-Pascal violations.*)* 

After completing the translation of the program to interpretive code, the Pascal system 
indicates that execution of the translated program began. The output from the execution of 
the program then appeared. At program termination, the Pascal runtime system indicated the 
number of statements executed, and the amount of cpu time used, with the resolution of the 
latter being 1/60’th of a second. 

Let us now fix the error in the program and translate it to a permanent object code file 
obj using pi. The program pi translates Pascal programs but stores the object code instead of 
executing itj. 

% ex first.p 

"first.p" 4 lines, 59 characters 

:1 print 

program first(output) 

:s/$/; 

program first(output); 

:write 

"first.p" 4 lines, 60 characters 

:quit 

% pi first.p 

% 


fThe standard Pascal warnings occur only when the associated s translator option is enabled. The s option 
is discussed in sections 5.1 and A.6 below. Warning diagnostics are discussed at the end of section 3.2, the 
associated w option is described in section 5.2. 

JThis script indicates some other useful approaches to debugging Pascal programs. As in ed we can shorten 
commands in ex to an initial prefix of the command name as we did with the substitute command here. We 
have also used the T shell escape Command here to execute other commands with a shell without leaving the 
editor. 
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If we now use the UNIX Is list files command we can see what files we have: 

% Is 

first.p 

obj 

% 

The file ‘obj’ here contains the Pascal interpreter code. We can execute this by typing: 

% px obj 
Hello, world! 

1 statements executed in 0.02 seconds cpu time. 

% 

Alternatively, the command: 

% obj 

will have the same effect. Some examples of different ways to execute the program follow. 

% px 

Hello, world! 

1 statements executed in 0.02 seconds cpu time. 

% pi — p first.p 
% px obj 
Hello, world! 

% pix —p first.p 
Hello, world! 

% 

Note that px will assume that ‘obj’ is the file we wish to execute if we don’t tell it other¬ 
wise. The last two translations use the -p no-post-mortem option to eliminate execution 
statistics and ‘Execution begins’ and ‘Execution terminated’ messages. See section 5.2 for 
more details. If we now look at the files in our directory we will see: 

% Is 

first.p 

obj 

% 

We can give our object program a name other than ‘obj’ by using the move command mv (1). 
Thus to name our program ‘hello’: 

% mv obj hello 
% hello 

Hello, world! 

% Is 

first.p 

hello 

% 

Finally we can get rid of the Pascal object code by using the rm (1) remove file command, e.g.: 

% rm hello 
% Is 
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first.p 

% 


For small programs which are being developed pix tends to be more convenient to use 
than pi and px. Except for absence of the obj file after a pix run, a pix command is equivalent 
to a pi command followed by a px command. For larger programs, where a number of runs 
testing different parts of the program are to be made, pi is useful as this obj file can be exe¬ 
cuted any desired number of times. 

2.2. A larger program 

Suppose that we have used the editor to put a larger program in the file ‘bigger.p\ We 
can list this program with line numbers by using the program cat -n i.e.: 

% cat — n bigger.p 

1 (* 

2 * Graphic representation of a function 


3 

* f(x) 

= exp( —x) * sin(2 * pi * x) 

4 

*) 


5 

program graph 1 (output); 

6 

const 


7 


d = 0.0625; (* 1/16, 16 lines for interval [x, x-fl] *) 

8 


s = 32; (* 32 character width for interval [x, x+1] 

9 


h = 34; (* Character position of x —axis *) 

10 


c = 6.28138; (* 2 * pi *) 

11 


lim = 32; 

12 

var 


13 


x, y: real; 

14 


i, n: integer; 

15 

begin 


16 


for i : = 0 to lim begin 

17 


x := d / i; 

18 


y := exp( —x9 * sin(i * x); 

19 


n : = Round(s * y) + h; 

20 


repeat 

21 


write (''); 

22 


n := n — 1 

23 


writeln('*") 

24 

end. 



% 

This program is similar to program 4.9 on page 30 of the Jensen-Wirth User Manual. A 
number of problems have been introduced into this example for pedagogical reasons. 

If we attempt to translate and execute the program using pix we get the following 
response: 

% pix bigger.p 

Wed Oct 29 17:11 1980 bigger.p: 


9 h = 34; (* Character position of x —axis *) 

w .|.(* in a (* ... *) comment 

16 for i := 0 to lim begin 

e .-.t.Inserted keyword do 

18 y := exp( —x9 * sin(i * x); 

E.-.t.Undefined variable 

e .f.Inserted")" 

19 n := Roundfs * y) + h; 


329 












E .t.Undefined function 

E.t.Undefined variable 

23 writelnC*') 

e .t.Inserted V 

24 end. 

E .f.Expected keyword until 

E . 

0 . 


OExecution suppressed due to compilation errors 

% 


Since there were fatal ‘E’ errors in our program, no code was generated and execution 
was necessarily suppressed. One thing which would be useful at this point is a listing of the 
program with the error messages. We can get this by using the command: 

% pi —1 bigger.p 

There is no point in using pix here, since we know there are fatal errors in the program. This 
command will produce the output at our terminal. If we are at a terminal which does not pro¬ 
duce a hard copy we may wish to print this listing off-line on a line printer. We can do this 
with the command: 

% pi —1 bigger.p I lpr 

In the next few sections we will illustrate various aspects of the Berkeley Pascal system 
by correcting this program. 

2.3. Correcting the first errors 

Most of the errors which occurred in this program were syntactic errors, those in the for¬ 
mat and structure of the program rather than its content. Syntax errors are flagged by print¬ 
ing the offending line, and then a line which flags the location at which an error was detected. 
The flag line also gives an explanation stating either a possible cause of the error, a simple 
action which can be taken to recover from the error so as to be able to continue the analysis, a 
symbol which was expected at the point of error, or an indication that the input was ‘mal¬ 
formed’. In the last case, the recovery may skip ahead in the input to a point where analysis of 
the program can continue. 

In this example, the first error diagnostic indicates that the translator detected a com¬ 
ment within a comment. While this is not considered an error in ‘standard’ Pascal, it usually 
corresponds to an error in the program which is being translated. In this case, we have 
accidentally omitted the trailing ‘*)’ of the comment on line 8. We can begin an editor session 
to correct this problem by doing: 

% ex bigger.p 

"bigger.p” 24 lines, 512 characters 

:8s/$/ *) 

s = 32; (* 32 character width for interval [x, x+1] *) 


The second diagnostic, given after line 16, indicates that the keyword do was expected 
before the keyword begin in the for statement. If we examine the statement syntax chart on 
page 118 of the Jensen-Wirth User Manual we will discover that do is a necessary part of the 
for statement. Similarly, we could have referred to section C.3 of the Jensen-Wirth User 
Manual to learn about the for statement and gotten the same information there. It is often 
useful to refer to these syntax charts and to the relevant sections of this book. 
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We can correct this problem by first scanning for the keyword for in the file and then 
substituting the keyword do to appear in front of the keyword begin there. Thus: 


/for 

for i : = 0 to lim begin 
s/begin/do & 

for i : = 0 to lim do begin 


The next error in the program is easy to pinpoint. On line 18, we didn’t hit the shift key and 
got a ‘9’ instead of a % The translator diagnosed that ‘x9’ was an undefined variable and, 
later, that a *)’ was missing in the statement. It should be stressed that pi is not suggesting 
that you should insert a T before the It is only indicating that making this change will help 
it to be able to continue analyzing the program so as to be able to diagnose further errors. You 
must then determine the true cause of the error and make the appropriate correction to the 
source text. 

This error also illustrates the fact that one error in the input may lead to multiple error 
diagnostics. Pi attempts to give only one diagnostic for each error, but single errors in the 
input sometimes appear to be more than one error. It is also the case that pi may not detect 
an error when it occurs, but may detect it later in the input. This would have happened in this 
example if we had typed ‘x’ instead of ‘x9\ 

The translator next detected, on line 19, that the function Round and the variable h were 
undefined. It does not know about Round because Berkeley Pascal normally distinguishes 
between upper and lower case.f On UNIX lower-case is preferred^, and all keywords and built- 
in procedure and function names are composed of lower-case letters, just as they are in the 
Jensen-Wirth Pascal Report . Thus we need to use the function round here. As far as h is con¬ 
cerned, we can see why it is undefined if we look back to line 9 and note that its definition was 
lost in the non-terminated comment. This diagnostic need not, therefore, concern us. 

The next error which occurred in the program caused the translator to insert a V before 
the statement calling writeln on line 23. If we examine the program around the point of error 
we will see that the actual error is that the keyword until and an associated expression have 
been omitted here. Note that the diagnostic from the translator does not indicate the actual 
error, and is somewhat misleading. The translator made the correction which seemed to be 
most plausible. As the omission of a V character is a common mistake, the translator chose to 
indicate this as a possible fix here. It later detected that the keyword until was missing, but 
not until it saw the keyword end on line 24. The combination of these diagnostics indicate to 
us the true problem. 

The final syntactic error message indicates that the translator needed an end keyword to 
match the begin at line 15. Since the end at line 24 is supposed to match this begin, we can 
infer that another begin must have been mismatched, and have matched this end. Thus we 
see that we need an end to match the begin at line 16, and to appear before the final end. 
We can make these corrections: 


/x9/s//x) 

y := exp( —x) * sin(i * x); 
+s/Round/round 

n := roundfs * y) + h; 

/write 


writeC '); 


/ 


writeln ('*') 


tin "standard" Pascal no distinction is made based on case. 

JOne good reason for using lower-case is that it is easier to type. 


331 









: insert 


until n = 0; 


end. 

:insert 

end 


At the end of each procedure or function and the end of the program the translator 
summarizes references to undefined variables and improper usages of variables. It also gives 
warnings about potential errors. In our program, the summary errors do not indicate any 
further problems but the warning that c is unused is somewhat suspicious. Examining the pro¬ 
gram we see that the constant was intended to be used in the expression which is an argument 
to sin , so we can correct this expression, and translate the program. We have now made a 
correction for each diagnosed error in our program. 

:?i ?s//c / 

y := exp( —x) * sin(c * x); 

:write 

"bigger.p" 26 lines, 538 characters 

rquit 

% pi bigger.p 

% 

It should be noted that the translator suppresses warning diagnostics for a particular pro¬ 
cedure, function or the main program when it finds severe syntax errors in that part of the 
source text. This is to prevent possibly confusing and incorrect warning diagnostics from being 
produced. Thus these warning diagnostics may not appear in a program with bad syntax 
errors until these errors are corrected. 
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We are now ready to execute our program for the first time. We will do so in the next 
section after giving a listing of the corrected program for reference purposes. 

% cat — n bigger.p 

1 (* 

2 * Graphic representation of a function 

3 * f(x) = exp(-x) * sin(2 * pi * x) 

4 *) 

5 program graph 1 (output); 

6 const 


7 


d = 

0.0625; (* 1/16, 16 lines for interval [x, x+1] *) 

8 


s = 

32; (* 32 character width for interval [x, x+1] 

9 


h = 

34; (* Character position of x —axis *) 

10 


c = 

6.28138; (* 2 * pi *) 

11 


lim 

= 32; 

12 

var 



13 


x, y: 

real; 

14 


i, n: 

integer; 

15 

begin 



16 


for i 

: = 0 to lim do begin 

17 



x : = d / i; 

18 



y := exp( —x) * sin(c * x); 

19 



n : = round(s * y) + h; 

20 



repeat 

21 



write (''); 

22 



n := n — 1 

23 



until n = 0; 

24 



writeln('*') 

25 


end 


26 

end. 




2.4. Executing the second example 

We are now ready to execute the second example. The following output was produced by 
our first run. 

% px 

Execution begins... 


Floating point division error 

Error in "graphl"+2 near line 17. 

Execution terminated abnormally. 

2 statements executed in 0.05 seconds cpu time. 

% 

Here the interpreter is presenting us with a runtime error diagnostic. It detected a ‘division by 
zero’ at line 17. Examining line 17, we see that we have written the statement ‘x := d / i’ 
instead of ‘x := d * i\ We can correct this and rerun the program: 

% ex bigger.p 

"bigger.p" 26 lines, 538 characters 
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17 


x := d / i 

:s7'* 

x := d * i 

:write 

"bigger.p" 26 lines, 538 characters 

:q 

% pix bigger.p 

Execution begins... 


* 

* 



* 

* 

Execution terminated. 


2550 statements executed in 0.30 seconds cpu time. 

% 

This appears to be the output we wanted. We could now save the output in a file if we 
wished by using the shell to redirect the output: 

% px > graph 

We can use cat (1) to see the contents of the file graph. We can also make a listing of the 
graph on the line printer without putting it into a file, e.g. 
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% px I lpr 

Execution begins... 

Execution terminated. 

2550 statements executed in 0.37 seconds cpu time. 

% 

Note here that the statistics lines came out on our terminal. The statistics line comes out on 
the diagnostic output (unit 2.) There are two ways to get rid of the statistics line. We can 
redirect the statistics message to the printer using the syntax ‘l&’ to the shell rather than T, 
i.e.: 


% px l& lpr 

% 

or we can translate the program with the p option disabled on the command line as we did 
above. This will disable all post-mortem dumping including the statistics line, thus: 

% pi —p bigger.p 
% px I lpr 

% 

This option also disables the statement limit which normally guards against infinite looping. 
You should not use it until your program is debugged. Also if p is specified and an error 
occurs, you will not get run time diagnostic information to help you determine what the prob¬ 
lem is. 

2.5. Formatting the program listing 

It is possible to use special lines within the source text of a program to format the pro¬ 
gram listing. An empty line (one with no characters on it) corresponds to a ‘space’ macro in 
an assembler, leaving a completely blank line without a line number. A line containing only a 
control-1 (form-feed) character will cause a page eject in the listing with the corresponding line 
number suppressed. This corresponds to an ‘eject’ pseudo-instruction. See also section 5.2 for 
details on the n and i options of pi. 

2.6. Execution profiling 

An execution profile consists of a structured listing of (all or part of) a program with 
information about the number of times each statement in the program was executed for a par¬ 
ticular run of the program. These profiles can be used for several purposes. In a program 
which was abnormally terminated due to excessive looping or recursion or by a program fault, 
the counts can facilitate location of the error. Zero counts mark portions of the program which 
were not executed; during the early debugging stages they should prompt new test data or a 
re-examination of the program logic. The profile is perhaps most valuable, however, in draw¬ 
ing attention to the (typically small) portions of the program that dominate execution time. 
This information can be used for source level optimization. 

An example 

A prime number is a number which is divisible only by itself and the number one. The 
program primes , written by Niklaus Wirth, determines the first few prime numbers. In 
translating the program we have specified the z option to pix. This option causes the transla¬ 
tor to generate counters and count instructions sufficient in number to determine the number 
of times each statement in the program was executed.*)* When execution of the program 

fThe counts are completely accurate only in the .absence of runtime errors and nonlocal goto statements. 

This is not generally a problem, however, as in structured programs nonlocal goto statements occur infre¬ 
quently, and counts are incorrect after abnormal termination only when the upward look described below to 
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completes, either normally or abnormally, this count data is written to the file pmon.out in the 
current directory4 It is then possible to prepare an execution profile by giving pxp the name of 
the file associated with this data, as was done in the following example. 


% pix — 1 — z primes.p 

Berkeley Pascal PI-Version 3.0 (7/26/83) 


program primes(output); 
const n = 50; nl = 7; (*nl = sqrt(n)*) 
var i,k,x,inc,lim,square,1: integer; 
prim: boolean; 
p,v: array[l..nl] of integer; 
begin 

write(2:6, 3:6); 1 := 2; 

x := 1; inc := 4; lim := 1; square := 9; 

for i : = 3 to n do 

begin (*find next prime*) 

repeat x := x 4- inc; inc := 6— inc; 
if square < = x then 
begin lim := lim+1; 

v[lim] := square; square := sqr(p[lim+l]) 
end ; 

k := 2; prim := true; 
while prim and (k<lim) do 
begin k := k-fl; 

if v[k] < x then v[k] := v[k] + 2*p[k]; 
prim := x <> v[k] 
end 

until prim; 

if i <= nl then p[i] := x; 
write(x:6); 1 := 1+1; 
if 1 = 10 then 

begin writeln; 1 : = 0 
end 


Wed Oct 29 17:11 1980 primes.p 

1 
2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 end; 

29 writeln; 

30 end . 


Execution 

2 

begins... 

3 5 

7 

11 

31 

37 

41 

43 

47 

73 

79 

83 

89 

97 

127 

131 

137 

139 

149 

179 

181 

191 

193 

197 


13 

17 

19 

23 

29 

53 

59 

61 

67 

71 

101 

103 

107 

109 

113 

151 

157 

163 

167 

173 

199 

211 

223 

227 

229 


Execution terminated. 

1404 statements executed in 0000.670 seconds cpu time. 

% 


get a count passes a suspended call point. 

J Pmon.out has a name similar to mon.out the monitor file produced by the profiling facility of the C com¬ 
piler cc (1). See prof (1) for a discussion of the C compiler profiling facilities. 
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Discussion 

The header lines of the outputs of pix and pxp in this example indicate the version of the 
translator and execution profiler in use at the time this example was prepared. The time given 
with the file name (also on the header line) indicates the time of last modification of the pro¬ 
gram source file. This time serves to version stamp the input program. Pxp also indicates the 
time at which the profile data was gathered. 

% pxp — z primes.p 

Berkeley Pascal PXP-Version 2.12 (5/11/83) 

Wed Oct 29 17:11 1980 primes.p 
Profiled Tue Aug 28 01:17 1984 


1 1. —Iprogram primes(output); 

2 Iconst 

2 I n = 50; 

2 I nl = 7; (*nl = sqrt(n)*) 

3 Ivar 

3 I i, k, x, inc, lim, square, 1: integer; 

4 I prim: boolean; 

5 I p, v: array [l..nl] of integer; 

6 begin 

7 I write(2: 6, 3: 6); 

7 I 1 := 2; 

8 I x := 1; 

8 I inc := 4; 

8 I lim : = 1; 

8 I square := 9; 

9 I for i := 3 to n do begin (*find next prime*) 


9 

48. —4 

repeat 

11 

76. 

—1 x := x -1- inc; 

11 


1 inc : = 6 — inc; 

12 


1 if square < = x then begin 

13 


5.-1 lim := lim + 1; 

14 


1 v[lim] := square; 

14 


1 square := sqr(p[lim +1]) 

14 


1 end; 

16 


1 k:=2; 

16 


1 prim := true; 

17 


1 while prim and (k < lim) do begin 

18 


157. —-1 k := k 4- 1; 

19 


1 if v[k] < x then 

19 


42. —1 v[k] := v[k] + 2 * p[k]; 

20 


1 prim := x <> v[k] 

20 


1 end 

20 


luntil prim; 

23 

1 

if i <= nl then 

23 

5. 

—-1 p[i] := x; 

24 

1 

write(x: 6); 

24 

1 

1:= 1+ 1; 

25 

1 

if 1 = 10 then begin 

26 

5. 

——1 writeln; 

26 


1 1 := 0 
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26 

26 

29 

29 


I end 


I end; 

I writeln 
lend. 


% 


To determine the number of times a statement was executed, one looks to the left of the 
statement and finds the corresponding vertical bar T. If this vertical bar is labelled with a 
count then that count gives the number of times the statement was executed. If the bar is not 
labelled, we look up in the listing to find the first ? which directly above the original one which 
has a count and that is the answer. Thus, in our example, k was incremented 157 times on 
line 18, while the write procedure call on line 24 was executed 48 times as given by the count 
on the repeat. 

More information on pxp can be found in its manual section pxp (1) and in sections 5.4, 
5.5 and 5.10. 
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3. Error diagnostics 

This section of the User's Manual discusses the error diagnostics of the programs pi, pc 
and px. Pix is a simple but useful program which invokes pi and px to do all the real process¬ 
ing. See its manual section pix (1) and section 5.2 below for more details. All the diagnostics 
given by pi will also be given by pc. 

3.1. Translator syntax errors 

A few comments on the general nature of the syntax errors usually made by Pascal pro¬ 
grammers and the recovery mechanisms of the current translator may help in using the sys¬ 
tem. 

Illegal characters 

Characters such as ‘$\ T, and ‘@’ are not part of the language Pascal. If they are found 
in the source program, and are not part of a constant string, a constant character, or a com¬ 
ment, they are considered to be ‘illegal characters’. This can happen if you leave off an open¬ 
ing string quote Note that the character although used in English to quote strings, is 
not used to quote strings in Pascal. Most non-printing characters in your input are also illegal 
except in character constants and character strings. Except for the tab and form feed charac¬ 
ters, which are used to ease formatting of the program, non-printing characters in the input file 
print as the character *?’ so that they will show in your listing. 

String errors 

There is no character string of length 0 in Pascal. Consequently the input is not 
acceptable. Similarly, encountering an end-of-line after an opening string quote without 
encountering the matching closing quote yields the diagnostic “Unmatched ' for string”. It is 
permissible to use the character instead of to delimit character and constant strings for 
portability reasons. For this reason, a spuriously placed sometimes causes the diagnostic 
about unbalanced quotes. Similarly, a in column one is used when preparing programs 
which are to be kept in multiple files. See section 5.11 for details. 

Comments in a comment, non-terminated comments 

As we saw above, these errors are usually caused by leaving off a comment delimiter. You 
can convert parts of your program to comments without generating this diagnostic since there 
are two different kinds of comments - those delimited by and ‘}\ and those delimited by *(*’ 
and ‘*)\ Thus consider: 

{ This is a comment enclosing a piece of program 
a := functioncall; (* comment within comment *) 
procedurecall; 

lhs : = rhs; (* another comment *) 


By using one kind of comment exclusively in your program you can use the other delim¬ 
iters when you need to “comment out” parts of your program!. In this way you will also allow 
the translator to help by detecting statements accidentally placed within comments. 

If a comment does not terminate before the end of the input file, the translator will point 
to the beginning of the comment, indicating that the comment is not terminated. In this case 
processing will terminate immediately. See the discussion of “QUIT” below. 


f If you wish to transport your program, especially to the 6000-3.4 implementation, you should use the char¬ 
acter sequence *(** to delimit comments. For transportation over the rcslink to Pascal 6000-3.4, the character 
should be used to delimit characters and constant strings. 
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Digits in numbers 

This part of the language is a minor nuisance. Pascal requires digits in real numbers 
both before and after the decimal point. Thus the following statements, which look quite rea¬ 
sonable to FORTRAN users, generate diagnostics in Pascal: 

Wed Oct 29 17:11 1980 digits.p: 


4 r := 0.; 

e —.t.Digits required after decimal point 

5 r := .0; 

e .t.Digits required before decimal point 

6 r := I.elO; 

e .f.- Digits required after decimal point 

7 r := .05e-10; 

e .|.Digits required before decimal point 


These same constructs are also illegal as input to the Pascal interpreter px. 

Replacements, insertions, and deletions 

When a syntax error is encountered in the input text, the parser invokes an error 
recovery procedure. This procedure examines the input text immediately after the point of 
error and considers a set of simple corrections to see whether they will allow the analysis to 
continue. These corrections involve replacing an input token with a different token, inserting a 
token, or replacing an input token with a different token. Most of these changes will not cause 
fatal syntax errors. The exception is the insertion of or replacement with a symbol such as an 
identifier or a number; in this case the recovery makes no attempt to determine which 
identifier or what number should be inserted, hence these are considered fatal syntax errors. 

Consider the following example. 

% pix —1 synerr.p 

Berkeley Pascal PI — Version 3.0 (7/26/83) 

Wed Oct 29 17:11 1980 synerr.p 

1 program syn(output); 

2 var i, j are integer; 

e .t— Replaced identifier with a Y 

3 begin 

4 for j :* 1 to 20 begin 


e .f—- Replaced with a 

e .t— Inserted keyword do 

5 write(j); 

6 i = 2 ** j; 

e .t— Inserted Y 

E .f — Inserted identifier 

7 writeln(i)) 

E .t— Deleted ')' 


8 end 

9 end. 

% 

The only surprise here may be that Pascal does not have an exponentiation operator, hence the 
complaint about This error illustrates that, if you assume that the language has a feature 
which it does not, the translator diagnostic may not indicate this, as the translator is unlikely 
to recognize the construct you supply. 
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Undefined or improper identifiers 

If an identifier is encountered in the input but is undefined, the error recovery will 
replace it with an identifier of the appropriate class. Further references to this identifier will 
be summarized at the end of the containing procedure or function or at the end of the pro¬ 
gram if the reference occurred in the main program. Similarly, if an identifier is used in an 
inappropriate way, e.g. if a type identifier is used in an assignment statement, or if a simple 
variable is used where a record variable is required, a diagnostic will be produced and an 
identifier of the appropriate type inserted. Further incorrect references to this identifier will be 
flagged only if they involve incorrect use in a different way, with all incorrect uses being sum¬ 
marized in the same way as undefined variable uses are. 

Expected symbols, malformed constructs 

If none of the above mentioned corrections appear reasonable, the error recovery will 
examine the input to the left of the point of error to see if there is only one symbol which can 
follow this input. If this is the case, the recovery will print a diagnostic which indicates that 
the given symbol was ‘Expected’. 

In cases where none of these corrections resolve the problems in the input, the recovery 
may issue a diagnostic that indicates that the input is “malformed”. If necessary, the transla¬ 
tor may then skip forward in the input to a place where analysis can continue. This process 
may cause some errors in the text to be missed. 

Consider the following example: 

% pix —1 synerr2.p 

Berkeley Pascal PI-Version 3.0 (7/26/83) 

Wed Oct 29 17:11 1980 synerr2.p 


1 program synerr2 (input,outpu); 

2 integer a(10) 

E —f.Malformed declaration 

3 begin 

4 read(b); 

E .t.Undefined variable 

5 for c := 1 to 10 do 

E .|.Undefined variable 

6 a(c) := b * c; 

E.-.—f-Undefined procedure 

E .t.Malformed statement 

7 end. 

El— File outpu listed in program statement but not declared 
In program synerr2: 

E — a undefined on line 6 
E — b undefined on line 4 
E — c undefined on lines 5 6 


Execution suppressed due to compilation errors 

% 

Here we misspelled output and gave a FORTRAN style variable declaration which the translator 
diagnosed as a ‘Malformed declaration’. When, on line 6, we used ‘(’ and ‘)’ for subscripting (as 
in FORTRAN) rather than the ‘[’ and ‘]’ which are used in Pascal, the translator noted that a 
was not defined as a procedure. This occurred because procedure and function argument 
lists are delimited by parentheses in Pascal. As it is not permissible to assign to procedure 
calls the translator diagnosed a malformed statement at the point of assignment. 
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Expected and unexpected end-of-file, “QUIT” 

If the translator finds a complete program, but there is more non-comment text in the 
input file, then it will indicate that an end-of-file was expected. This situation may occur after 
a bracketing error, or if too many ends are present in the input. The message may appear 
after the recovery says that it “Expected Y” since Y is the symbol that terminates a program. 

If severe errors in the input prohibit further processing the translator may produce a 
diagnostic followed by “QUIT”. One example of this was given above - a non-terminated com¬ 
ment; another example is a line which is longer than 160 characters. Consider also the follow¬ 
ing example. 

% pix —1 mism.p 

Berkeley Pascal PI-Version 3.0 (7/26/83) 

Wed Oct 29 17:11 1980 mism.p 

1 program mismatch (output) 

2 begin 

e —t.Inserted Y 

3 writeln('***'); 

4 { The next line is the last line in the file j 

5 writeln 

E . 

0 . 

0 % 


3.2. Translator semantic errors 

The extremely large number of semantic diagnostic messages which the translator pro¬ 
duces make it unreasonable to discuss each message or group of messages in detail. The mes¬ 
sages are, however, very informative. We will here explain the typical formats and the termi¬ 
nology used in the error messages so that you will be able to make sense out of them. In any 
case in which a diagnostic is not completely comprehensible you can refer to the User Manual 
by Jensen and Wirth for examples. 

Format of the error diagnostics 

As we saw in the example program above, the error diagnostics from the Pascal translator 
include the number of a line in the text of the program as well as the text of the error message. 
While this number is most often the line where the error occurred, it is occasionally the 
number of a line containing a bracketing keyword like end or until. In this case, the diagnos¬ 
tic may refer to the previous statement. This occurs because of the method the translator uses 
for sampling line numbers. The absence of a trailing V in the previous statement causes the 
line number corresponding to the end or until, to become associated with the statement. As 
Pascal is a free-format language, the line number' associations can only be approximate and 
may seem arbitrary to some users. This is the only notable exception, however, to reasonable 
associations. 

Incompatible types 

Since Pascal is a strongly typed language, many semantic errors manifest themselves as 
type errors. These are called ‘type clashes’ by the translator. The types allowed for various 
operators in the language are summarized on page 108 of the Jensen-Wirth User Manual. It is 
important to know that the Pascal translator, in its diagnostics, distinguishes between the fol¬ 
lowing type ‘classes’: 
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array Boolean char file integer 

pointer real record scalar string 

These words are plugged into a great number of error messages. Thus, if you tried to assign an 
integer value to a char variable you would receive a diagnostic like the following: 

Tue Oct 14 21:37 1980 clash.p: 

E 7 — Type clash: integer is incompatible with char 
... Type of expression clashed with type of variable in assignment 

In this case, one error produced a two line error message. If the same error occurs more than 
once, the same explanatory diagnostic will be given each time. 

Scalar 

The only class whose meaning is not self-explanatory is ‘scalar’. Scalar has a precise 
meaning in the Jensen-Wirth User Manual where, in fact, it refers to char , integer , real , and 
Boolean types as well as the enumerated types. For the purposes of the Pascal translator, 
scalar in an error message refers to a user-defined, enumerated type, such as ops in the exam¬ 
ple above or color in 

type color = (red, green, blue) 

For integers, the more explicit denotation integer is used. Although it would be correct, in the 
context of the User Manual to refer to an integer variable as a scalar variable pi prefers the 
more specific identification. 

Function and procedure type errors 

For built-in procedures and functions, two kinds of errors occur. If the routines are 
called with the wrong number of arguments a message similar to: 

Tue Oct 14 21:38 1980 sinl.p: 

E 12 — sin takes exactly one argument 

is given. If the type of the argument is wrong, a message like 
Tue Oct 14 21:38 1980 sin2.p: 

E 12 — sin's argument must be integer or real, not char 

is produced. A few functions and procedures implemented in Pascal 6000-3.4 are diagnosed as 
unimplemented in Berkeley Pascal, notably those related to segmented files. 

Can’t read and write scalars, etc. 

The messages which state that scalar (user-defined) types cannot be written to and from 
files are often mysterious. It is in fact the case that if you define 

type color = (red, green, blue) 

“standard” Pascal does not associate these constants with the strings ‘red’, ‘green’, and ‘blue’ in 
any way. An extension has been added which allows enumerated types to be read and written, 
however if the program is to be portable, you will have to write your own routines to perform 
these functions. Standard Pascal only allows the reading of characters, integers and real 
numbers from text files. You cannot read strings or Booleans. It is possible to make a 

file of color 

but the representation is binary rather than string. 
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Expression diagnostics 

The diagnostics for semantically ill-formed expressions are very explicit. Consider this 
sample translation: 

% pi — 1 expr.p 

Berkeley Pascal PI-Version 2.0 (Sat Oct 18 21:01:54 1980) 

Tue Oct 14 21:37 1980 expr.p 

1 program x(output); 

2 var 

3 a: set of char; 

4 b: Boolean; 

5 c: (red, green, blue); 

6 p: f integer; 

7 A: alfa; 

8 B: packed array [1..5] of char; 

9 begin 

10 b := true; 

11 c := red; 

12 new(p); 

13 a := []; 

14 A := 'Hello, yellow'; 

15 b := a and b; 

16 a := a * 3; 

17 if input < 2 then writeln('boo'); 

18 if p <= 2 then writeln('sure nufF); 

19 if A = B then writeln('same'); 

20 if c = true then writeln('hue"s and color'Y) 

21 end. 

E 14 — Constant string too long 

E 15 — Left operand of and must be Boolean, not set 

E 16 — Cannot mix sets with integers and reals as operands of * 

E 17 — files may not participate in comparisons 

E 18 — pointers and integers cannot be compared — operator was < = 

E 19 — Strings not same length in = comparison 
E 20 — scalars and Booleans cannot be compared — operator was = 

In program x: 

w — constant green is never used 
w — constant blue is never used 
w — variable B is used but never set 

% 

This example is admittedly far-fetched, but illustrates that the error messages are sufficiently 
clear to allow easy determination of the problem in the expressions. 

Type equivalence 

Several diagnostics produced by the Pascal translator complain about ‘non-equivalent 
types’. In general, Berkeley Pascal considers variables to have the same type only if they were 
declared with the same constructed type or with the same type identifier. Thus, the variables x 
and y declared as 

var 

x: t integer; 
y: f integer; 
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do not have the same type. The assignment 

x :== y 

thus produces the diagnostics: 

Tue Oct 14 21:38 1980 typequ.p: 

E 7 — Type clash: non —identical pointer types 
... Type of expression clashed with type of variable in assignment 

Thus it is always necessary to declare a type such as 

type intptr = f integer; 
and use it to declare 

var x: intptr; y: intptr; 

Note that if we had initially declared 

var x, y: f integer; 

then the assignment statement would have worked. The statement 

xt := yt 

is allowed in either case. Since the parameter to a procedure or function must be declared 
with a type identifier rather than a constructed type, it is always necessary, in practice, to 
declare any type which will be used in this way. 

Unreachable statements 

Berkeley Pascal flags unreachable statements. Such statements usually correspond to 
errors in the program logic. Note that a statement is considered to be reachable if there is a 
potential path of control, even if it can never be taken. Thus, no diagnostic is produced for the 
statement: 

if false then 

writeln('impossible!') 

Goto’s into structured statements 

The translator detects and complains about goto statements which transfer control into 
structured statements (for, while, etc.) It does not allow such jumps, nor does it allow branch¬ 
ing from the then part of an if statement into the else part. Such checks are made only 
within the body of a single procedure or function. 

Unused variables, never set variables 

Although pi always clears variables to 0 at procedure and function entry, pc does not 
unless runtime checking is enabled using the C option. It is not good programming practice to 
rely on this initialization. To discourage this practice, and to help detect errors in program 
logic, pi flags as a ‘w’ warning error: 

1) Use of a variable which is never assigned a value. 

2) A variable which is declared but never used, distinguishing between those variables 
for which values are computed but which are never used, and those completely 
unused. 
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In fact, these diagnostics are applied to all declared items. Thus a const or a procedure 
which is declared but never used is flagged. The w option of pi may be used to suppress these 
warnings; see sections 5.1 and 5.2. 

3.3. Translator panics, i/o errors 
Panics 

One class of error which rarely occurs, but which causes termination of all processing 
when it does is a panic. A panic indicates a translator-detected internal inconsistency. A typi¬ 
cal panic message is: 

snark (rvalue) line = 110 yyline = 109 
Snark in pi 

If you receive such a message, the translation will be quickly and perhaps ungracefully ter¬ 
minated. You should contact a teaching assistant or a member of the system staff*, after saving 
a copy of your program for later inspection. If you were making changes to an existing pro¬ 
gram when the problem occurred, you may be able to work around the problem by ascertaining 
which change caused the snark and making a different change or correcting an error in the 
program. A small number of panics are possible in px. All panics should be reported to a 
teaching assistant or systems staff* so that they can be fixed. 

Out of memory 

The only other error which will abort translation when no errors are detected is running 
out of memory. All tables in the translator, with the exception of the parse stack, are dynami¬ 
cally allocated, and can grow to take up the full available process space of 64000 bytes on the 
PDP-11. On the VAX-11, table sizes are extremely generous and very large (25000) line pro¬ 
grams have been easily accommodated. For the PDP-11, it is generally true that the size of the 
largest translatable program is directly related to procedure and function size. A number of 
non-trivial Pascal programs, including some with more than 2000 lines and 2500 statements 
have been translated and interpreted using Berkeley Pascal on PDP-ll’s. Notable among these 
are the Pascal-S interpreter, a large set of programs for automated generation of code genera¬ 
tors, and a general context-free parsing program which has been used to parse sentences with a 
grammar for a superset of English. In general, very large programs should be translated using 
pc and the separate compilation facility. 

If you receive an out of space message from the translator during translation of a large 
procedure or function or one containing a large number of string constants you may yet be 
able to translate your program if you break this one procedure or function into several rou¬ 
tines. 

I/O errors 

Other errors which you may encounter when running pi relate to input-output. If pi can¬ 
not open the file you specify, or if the file is empty, you will be so informed. 

3.4. Run-time errors 

We saw, in our second example, a run-time error. We here give the general description of 
run-time errors. The more unusual interpreter error messages are explained briefly in the 
manual section for px (1). 

Start-up errors 

These errors occur when the object file to be executed is not available or appropriate. 
Typical errors here are caused by the specified object file not existing, not being a Pascal 
object, or being inaccessible to the user. 
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Program execution errors 

These errors occur when the program interacts with the Pascal runtime environment in 
an inappropriate way. Typical errors are values or subscripts out of range, bad arguments to 
built-in functions, exceeding the statement limit because of an infinite loop, or running out of 
memory^. The interpreter will produce a backtrace after the error occurs, showing all the 
active routine calls, unless the p option was disabled when the program was translated. Unfor¬ 
tunately, no variable values are given and no way of extracting them is available.* * 

As an example of such an error, assume that we have accidentally declared the constant 
nl to be 6, instead of 7 on line 2 of the program primes as given in section 2.6 above. If we 
run this program we get the following response. 

% pix primes.p 

Execution begins... 


2 

3 

5 

7 

11 

13 

17 

19 

23 

29 

31 

37 

41 

43 

47 

53 

59 

61 

67 

71 

73 

79 

83 

89 

97 

101 

103 

107 

109 

113 

127 

131 

137 

139 

149 

151 

157 

163 

167 



Subscript value of 7 is out of range 
Program error 

Do you wish to enter the debugger? 

Entering debugger ... type "help' for help. 

% 

Here the interpreter indicates that the program terminated abnormally due to a subscript 
out of range near line 14, which is eight lines into the body of the program primes. 

Interrupts 

If the program is interrupted while executing and the p option was not specified, then a 
backtrace will be printed.f The file pmon.out of profile information will be written if the pro¬ 
gram was translated with the z option enabled to pi or pix. 

I/O interaction errors 

The final class of interpreter errors results from inappropriate interactions with files, 
including the user’s terminal. Included here are bad formats for integer and real numbers 
(such as no digits after the decimal point) when reading. 


JThe checks for running out of memory are not foolproof and there is a chance that the interpreter will fault, 
producing a core image when it runs out of memory. This situation occurs very rarely. 

* On the VAX-11, each variable is restricted to allocate at most 65000 bytes of storage (this is a PDP-llism 
that has survived to the VAX.) 

tOccasionally, the Pascal system will be in an inconsistent state when this occurs, e.g. when an interrupt ter¬ 
minates a procedure or function entry or exit. In this case, the backtrace will only contain the current 
line. A reverse call order list of procedures will not be given. 
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4. Input/output 

This section describes features of the Pascal input/output environment, with special con¬ 
sideration of the features peculiar to an interactive implementation. 

4.1. Introduction 

Our first sample programs, in section 2, used the file output . We gave examples there of 
redirecting the output to a file and to the line printer using the shell. Similarly, we can read 
the input from a file or another program. Consider the following Pascal program which is 
similar to the program cat (1). 

% pix —1 kat.p <primes 

Berkeley Pascal PI-Version 3.0 (7/26/83) 

Wed Oct 29 17:11 1980 kat.p 

1 program kat(input, output); 

2 var 

3 ch: char; 

4 begin 

5 while not eof do begin 

6 while not eoln do begin 

7 read(ch); 

8 write (ch) 

9 end; 

10 readln; 

11 writeln 

12 end 

13 end { kat j. 

Execution begins... 


2 

3 

5 

7 

11 

13 

17 

19 

23 

29 

31 

37 

41 

43 

47 

53 

59 

61 

67 

71 

73 

79 

83 

89 

97 

101 

103 

107 

109 

113 

127 

131 

137 

139 

149 

151 

157 

163 

167 

173 

179 

181 

191 

193 

197 

199 

211 

223 

227 

229 


Execution terminated. 

925 statements executed in 0000.680 seconds cpu time. 

% 

Here we have used the shell’s syntax to redirect the program input from a file in primes in 
which we had placed the output of our prime number program of section 2.6. It is also possible 
to ‘pipe’ input to this program much as we piped input to the line printer daemon Ipr (1) 
before. Thus, the same output as above would be produced by 

% cat primes I pix —1 kat.p 

All of these examples use the shell to control the input and output from files. One very 
simple way to associate Pascal files with named UNIX files is to place the file name in the pro¬ 
gram statement. For example, suppose we have previously created the file data. We then use 
it as input to another version of a listing program. 

% cat data 

line one. 

line two. 
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line three is the end. 

% pix —I copydata.p 

Berkeley Pascal PI-Version 3.0 (7/26/83) 

Wed Oct 29 17:11 1980 copydata.p 

1 program copydata(data, output); 

2 var 

3 ch: char; 

4 data: text; 

5 begin 

6 reset (data); 

7 while not eof(data) do begin 

8 while not eoln(data) do begin 

9 read(data, ch); 

10 write(ch) 

11 end; 

12 readln(data); 

13 writeln 

14 end 

15 end { copydata J. 

Execution begins... 

line one. 

line two. 

line three is the end. 

Execution terminated. 

134 statements executed in 0000.110 seconds cpu time. 

% 

By mentioning the file data in the program statement, we have indicated that we wish it to 
correspond to the UNIX file data. Then, when we ‘reset(data)’, the Pascal system opens our 
file ‘data’ for reading. More sophisticated, but less portable, examples of using UNIX files will 
be given in sections 4.5 and 4.6. There is a portability problem even with this simple example. 
Some Pascal systems attach meaning to the ordering of the file in the program statement file 
list. Berkeley Pascal does not do so. 

4.2. Eof and eoln 

An extremely common problem encountered by new users of Pascal, especially in the 
interactive environment offered by UNIX, relates to the definitions of eof and eoln. These 
functions are supposed to be defined at the beginning of execution of a Pascal program, indi¬ 
cating whether the input device is at the end of a line or the end of a file. Setting eof or eoln 
actually corresponds to an implicit read in which the input is inspected, but no input is “used 
up”. In fact, there is no way the system can know whether the input is at the end-of-file or the 
end-of-line unless it attempts to read a line from it. If the input is from a previously created 
file, then this reading can take place without run-time action by the user. However, if the 
input is from a terminal, then the input is what the user types.f If the system were to do an 
initial read automatically at the beginning of program execution, and if the input were a termi¬ 
nal, the user would have to type some input before execution could begin. This would make it 
impossible for the program to begin by prompting for input or printing a herald. 


fit is not possible to determine whether the input is a terminal, as the input may appear to be a file but ac¬ 
tually be a pipe, the output of a program which is reading from the terminal. 
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Berkeley Pascal has been designed so that an initial read is not necessary. At any given 
time, the Pascal system may or may not know whether the end-of-file or end-of-line conditions 
are true. Thus, internally, these functions can have three values - true, false, and “I don’t 
know yet; if you ask me I’ll have to find out”. All files remain in this last, indeterminate state 
until the Pascal program requires a value for eof or eoln either explicitly or implicitly, e.g. in a 
call to read. The important point to note here is that if you force the Pascal system to deter¬ 
mine whether the input is at the end-of-file or the end-of-line, it will be necessary for it to 
attempt to read from the input. 

Thus consider the following example code 

while not eof do begin 

write ('number, please? '); 
read(i); 

writeln('that was a ', i: 2) 

end 

At first glance, this may be appear to be a correct program for requesting, reading and echoing 
numbers. Notice, however, that the while loop asks whether eof is true before the request is 
printed. This will force the Pascal system to decide whether the input is at the end-of-file. 
The Pascal system will give no messages; it will simply wait for the user to type a line. By pro¬ 
ducing the desired prompting before testing eof } the following code avoids this problem: 

write ('number, please ?'); 

while not eof do begin 

read(i); 

writeln('that was a ', i:2); 
write ('number, please ?') 

end 

The user must still type a line before the while test is completed, but the prompt will ask for 
it. This example, however, is still not correct. To understand why, it is first necessary to 
know, as we will discuss below, that there is a blank character at the end of each line in a Pas¬ 
cal text file. The read procedure, when reading integers or real numbers, is defined so that, if 
there are only blanks left in the file, it will return a zero value and set the end-of-file condition. 
If, however, there is a number remaining in the file, the end-of-file condition will not be set 
even if it is the last number, as read never reads the blanks after the number, and there is 
always at least one blank. Thus the modified code will still put out a spurious 

that was a 0 

at the end of a session with it when the end-of-file is reached. The simplest way to correct the 
problem in this example is to use the procedure readln instead of read here. In general, unless 
we test the end-of-file condition both before and after calls to read or readln , there will be 
inputs for which our program will attempt to read past end-of-file. 

4.3. More about eoln 

To have a good understanding of when eoln will be true it is necessary to know that in 
any file there is a special character indicating end-of-line, and that, in effect, the Pascal system 
always reads one character ahead of the Pascal read commands. - )* For instance, in response to 
‘read(ch)’, the system sets ch to the current input character and gets the next input character. 
If the current input character is the last character of the line, then the next input character 
from the file is the new-line character, the normal UNIX line separator. When the read routine 
gets the new-line character, it replaces that character by a blank (causing every line to end 

fin Pascal terms, ‘read(ch)’ corresponds to ‘ch := input"; get(input)’ 
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with a blank) and sets eoln to true. Eoln will be true as soon as we read the last character of 
the line and before we read the blank character corresponding to the end of line. Thus it is 
almost always a mistake to write a program which deals with input in the following way: 

read(ch); 
if eoln then 
Done with line 
else 

Normal processing 

as this will almost surely have the effect of ignoring the last character in the line. The 
‘read(ch)’ belongs as part of the normal processing. 

Given this framework, it is not hard to explain the function of a readln call, which is 
defined as: 

while not eoln do 
get(input); 
get (input); 

This advances the file until the blank corresponding to the end-of-line is the current input 
symbol and then discards this blank. The next character available from read will therefore be 
the first character of the next line, if one exists. 

4.4. Output buffering 

A final point about Pascal input-output must be noted here. This concerns the buffering 
of the file output. It is extremely inefficient for the Pascal system to send each character to 
the user’s terminal as the program generates it for output; even less efficient if the output is 
the input of another program such as the line printer daemon Ipr (1). To gain efficiency, the 
Pascal system “buffers” the output characters (i.e. it saves them in memory until the buffer is 
full and then emits the entire buffer in one system interaction.) However, to allow interactive 
prompting to work as in the example given above, this prompt must be printed before the Pas¬ 
cal system waits for a response. For this reason, Pascal normally prints all the output which 
has been generated for the file output whenever 

1) A writeln occurs, or 

2) The program reads from the terminal, or 

3) The procedure message or flush is called. 

Thus, in the code sequence 

for i := 1 to 5 do begin 
write(i: 2); 

Compute a lot with no output 

end; 

writeln 

<< 

the output integers will not print until the writeln occurs. The delay can be somewhat discon¬ 
certing, and you should be aware that it will occur. By setting the b option to 0 before the 
program statement by inserting a comment of the form 

(*$b0*) 

we can cause output to be completely unbuffered, with a corresponding horrendous degradation 
in program efficiency. Option control in comments is discussed in section 5. 
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4.5. Files, reset, and rewrite 

It is possible to use extended forms of the built-in functions reset and rewrite to get more 
general associations of UNIX file names with Pascal file variables. When a file other than 
input or output is to be read or written, then the reading or writing must be preceded by a 
reset or rewrite call. In general, if the Pascal file variable has never been used before, there will 
be no UNIX filename associated with it. As we saw in section 2.9, by mentioning the file in the 
program statement, we could cause a UNIX file with the same name as the Pascal variable to 
be associated with it. If we do not mention a file in the program statement and use it for the 
first time with the statement 

reset(f) 

or 


rewrite(f) 

then the Pascal system will generate a temporary name of the form ‘tmp.x’ for some character 
‘x’, and associate this UNIX file name name with the Pascal file. The first such generated 
name will be 'tmp.l’ and the names continue by incrementing their last character through the 
ASCII set. The advantage of using such temporary files is that they are automatically removed 
by the Pascal system as soon as they become inaccessible. They are not removed, however, if a 
runtime error causes termination while they are in scope. 

To cause a particular UNIX pathname to be associated with a Pascal file variable we can 
give that name in the reset or rewrite call, e.g. we could have associated the Pascal file data 
with the file 'primes’ in our example in section 3.1 by doing: 

reset(data, 'primes") 
instead of a simple 

reset(data) 

In this case it is not essential to mention 'data’ in the program statement, but it is still a good 
idea because is serves as an aid to program documentation. The second parameter to reset and 
rewrite may be any string value, including a variable. Thus the names of UNIX files to be asso¬ 
ciated with Pascal file variables can be read in at run time. Full details on file name/file vari¬ 
able associations are given in section A.3. 

4.6. Argc and argv 

Each UNIX process receives a variable length sequence of arguments each of which is a 
variable length character string. The built-in function argc and the built-in procedure argv can 
be used to access and process these arguments. The value of the function argc is the number 
of arguments to the process. By convention, the arguments are treated as an array, and 
indexed from 0 to argc- 1 , with the zeroth argument being the name of the program being exe¬ 
cuted. The rest of the arguments are those passed to the command on the command line. 
Thus, the command 

% obj /etc/motd /usr/dict/words hello 

will invoke the program in the file obj with argc having a value of 4. The zeroth element 
accessed by argv will be 'obj’, the first '/etc/motd’, etc. 

Pascal does not provide variable size arrays, nor does it allow character strings of varying 
length. For this reason, argv is a procedure and has the syntax 

argv(i, a) 

where i is an integer and a is a string variable. This procedure call assigns the (possibly 
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truncated or blank padded) i’th argument of the current process to the string variable a. The 
file manipulation routines reset and rewrite will strip trailing blanks from their optional second 
arguments so that this blank padding is not a problem in the usual case where the arguments 
are file names. 

We are now ready to give a Berkeley Pascal program ‘kat’, based on that given in section 
3.1 above, which can be used with the same syntax as the UNIX system program cat (1). 

% cat kat.p 

program kat (input, output); 
var 

ch: char; 
i: integer; 

name: packed array [1..100] of char; 
begin 
i := 1; 
repeat 

if i < argc then begin 
argv(i, name); 
reset(input, name); 
i := i + 1 
end; 

while not eof do begin 
while not eoln do begin 
read(ch); 
write (ch) 
end; 
readln; 
writeln 
end 

until i > = argc 
end { kat}. 

% 

Note that the reset call to the file input here, which is necessary for a clear program, may be 
disallowed on other systems. As this program deals mostly with argc and argv and UNIX sys¬ 
tem dependent considerations, portability is of little concern. 

If this program is in the file ‘kat.p', then we can do 

% pi kat.p 
% mv obj kat 
% kat primes 


2 

3 

5 

7 

11 

13 

17 

19 

23 

29 

31 

37 

41 

43 

47 

53 

59 

61 

67 

71 

73 

79 

83 

89 

97 

101 

103 

107 

109 

113 

127 

131 

137 

139 

149 

151 

157 

163 

167 

173 

179 

181 

191 

193 

197 

199 

211 

223 

227 

229 


930 statements executed in 0.18 seconds cpu time. 

% kat 

This is a line of text. 

This is a line of text. 

The next line contains only an end—of—file (an invisible control —d!) 

The next line contains only an end—of—file (an invisible control —d!) 
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287 statements executed in 0.03 seconds cpu time. 

% 

Thus we see that, if it is given arguments, ‘kat’ will, like cat, copy each one in turn. If no argu¬ 
ments are given, it copies from the standard input. Thus it will work as it did before, with 

% kat < primes 

now equivalent to 

% kat primes 

although the mechanisms are quite different in the two cases. Note that if ‘kat’ is given a bad 
file name, for example: 

% kat xxxxqqq 


Could not open xxxxqqq: No such file or directory 
Error in ”kat"+5 near line 11. 

4 statements executed in 0.02 seconds cpu time. 

% 

it will give a diagnostic and a post-mortem control flow backtrace for debugging. If we were 
going to use ‘kat’, we might want to translate it differently, e.g.: 

% pi — pb kat.p 
% mv obj kat 

Here we have disabled the post-mortem statistics printing, so as not to get the statistics or the 
full traceback on error. The b option will cause the system to block buffer the input/output so 
that the program will run more efficiently on large files. We could have also specified the t 
option to turn off runtime tests if that was felt to be a speed hindrance to the program. Thus 
we can try the last examples again: 

% kat xxxxqqq 


Could not open xxxxqqq: No such file or directory 

Error in "kat” 

% kat primes 


2 

3 

5 

7 

11 

13 

17 

19 

23 

29 

31 

37 

41 

43 

47 

53 

59 

61 

67 

71 

73 

79 

83 

89 

97 

101 

103 

107 

109 

113 

127 

131 

137 

139 

149 

151 

157 

163 

167 

173 

179 

181 

191 

193 

197 

199 

211 

223 

227 

229 


% 

The interested reader may wish to try writing a program which accepts command line 
arguments like pi does, using argc and argv to process them. 
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5. Details on the components of the system 

5.1. Options 

The programs pi , pc, and pxp take a number of options.! There is a standard UNIX con¬ 
vention for passing options to programs on the command line, and this convention is followed 
by the Berkeley Pascal system programs. As we saw in the examples above, option related 
arguments consisted of the character followed by a single character option name. 

Except for the b option which takes a single digit value, each option may be set on 
(enabled) or off (disabled.) When an on/off valued option appears on the command line of pi 
or it inverts the default setting of that option. Thus 

% pi — 1 foo.p 

enables the listing option 1, since it defaults off, while 

% pi — t foo.p 

disables the run time tests option t, since it defaults on. 

In additon to inverting the default settings of pi options on the command line, it is also 
possible to control the pi options within the body of the program by using comments of a spe¬ 
cial form illustrated by 

{$!-) 

Here we see that the opening comment delimiter (which could also be a ‘(*’) is immedi¬ 
ately followed by the character ‘$\ After this T, which signals the start of the option list, we 
can place a sequence of letters and option controls, separated by V characters^. The most 
basic actions for options are to set them, thus 

{$1+ Enable listing} 
or to clear them 

{$t—,p— No run—time tests, no post mortem analysis} 

Notice that ‘+’ always enables an option and always disables it, no matter what the default 
is. Thus has a different meaning in an option comment than it has on the command line. 
As shown in the examples, normal comment text may follow the option list. 

5.2. Options common to Pi, Pc, and Pix 

The following options are common to both the compiler and the interpreter. With each 
option we give its default setting, the setting it would have if it appeared on the command line, 
and a sample command using the option. Most options are on/off valued, with the b option 
taking a single digit value. 

Buffering of the file output - b 

The b option controls the buffering of the file output. The default is line buffering, with 
flushing at each reference to the file input and under certain other circumstances detailed in 
section 5 below. Mentioning b on the command line, e.g. 

fAs pix uses pi to translate Pascal programs, it takes the options of pi also. We refer to them here, however, 
as pi options. 

{This format was chosen because it is used by Pascal 6000-3.4. In general the options common to both im¬ 
plementations are controlled in the same way so that comment control in options is mostly portable. It is 
recommended, however, that only one control be put per comment for maximum portability, as the Pascal 
6000-3.4 implementation will ignore controls after the first one which it does not recognize. 
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% pi — b assembler.p 

causes standard output to be block buffered, where a block is some system-defined number of 
characters. The b option may also be controlled in comments. It, unique among the Berkeley 
Pascal options, takes a single digit value rather than an on or off setting. A value of 0, e.g. 

|$b0} 

causes the file output to be unbuffered. Any value 2 or greater causes block buffering and is 
equivalent to the flag on the command line. The option control comment setting b must pre¬ 
cede the program statement. 

Include file listing - i 

The i option takes the name of an include file, procedure or function name and causes 
it to be listed while translating!. Typical uses would be 

% pix — i scanner.i compiler.p 

to make a listing of the routines in the file scanner.i, and 

% pix — i scanner compiler.p 

to make a listing of only the routine scanner. This option is especially useful for 
conservation-minded programmers making partial program listings. 

Make a listing - 1 

The 1 option enables a listing of the program. The 1 option defaults off. When specified 
on the command line, it causes a header line identifying the version of the translator in use 
and a line giving the modification time of the file being translated to appear before the actual 
program listing. The 1 option is pushed and popped by the i option at appropriate points in 
the program. 

Standard Pascal only — s 

The s option causes many of the features of the UNIX implementation which are not 
found in standard Pascal to be diagnosed as ‘s’ warning errors. This option defaults off and is 
enabled when mentioned on the command line. Some of the features which are diagnosed are: 
non-standard procedures and functions, extensions to the procedure write, and the padding 
of constant strings with blanks. In addition, all letters are mapped to lower case except in 
strings and characters so that the case of keywords and identifiers is effectively ignored. The s 
option is most useful when a program is to be transported, thus 

% pi — s isitstd.p 

will produce warnings unless the program meets the standard. 

Runtime tests — t and C 

These options control the generation of tests that subrange variable values are within 
bounds at run time, pi defaults to generating tests and uses the option t to disable them, pc 
defaults to not generating tests, and uses the option C to enable them. Disabling runtime tests 
also causes assert statements to be treated as comments.! 


tlnclude files are discussed in section 5.9. 

$See section A.l for a description of assert statements. 
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Suppress warning diagnostics - w 

The w option, which defaults on, allows the translator to print a number of warnings 
about inconsistencies it finds in the input program. Turning this option off with a comment of 
the form 

{$w } 

or on the command line 

% pi —w tryme.p 

suppresses these usually useful diagnostics. 

Generate counters for a pxp execution profile - z 

The z option, which defaults off, enables the production of execution profiles. By specify¬ 
ing z on the command line, i.e. 

% pi — z foo.p 

or by enabling it in a comment before the program statement causes pi and pc to insert 
operations in the interpreter code to count the number of times each statement was executed. 
An example of using pxp was given in section 2.6; its options are described in section 5.6. Note 
that the z option cannot be used on separately compiled programs. 

5.3. Options available in Pi 

Post-mortem dump - p 

The p option defaults on, and causes the runtime system to initiate a post-mortem back¬ 
trace when an error occurs. It also cause px to count statements in the executing program, 
enforcing a statement limit to prevent infinite loops. Specifying p on the command line dis¬ 
ables these checks and the ability to give this post-mortem analysis. It does make smaller and 
faster programs, however. It is also possible to control the p option in comments. To prevent 
the post-mortem backtrace on error, p must be off at the end of the program statement. 
Thus, the Pascal cross-reference program was translated with 

% pi — pbt pxref.p 


5.4. Options available in Px 

The first argument to px is the name of the file containing the program to be interpreted. 
If no arguments are given, then the file obj is executed. If more arguments are given, they are 
available to the Pascal program by using the built-ins argc and argv as described in section 4.6. 

Px may also be invoked automatically. In this case, whenever a Pascal object file name is 
given as a command, the command will be executed with px prepended to it; that is 

% obj primes 

will be converted to read 

% px obj primes 

5.5. Options available in Pc 
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Generate assembly language - S 

The program is compiled and the assembly language output is left in file appended .s. 


Thus 


% 


% pc — S foo.p 

creates a file foo.s. No executable file is created. 


Symbolic Debugger Information - g 

The g option causes the compiler to generate information needed by sdb( 1) the symbolic 
debugger. For a complete description of sdb see Volume 2c of the UNIX Reference Manual. 

Redirect the output file - o 

The name argument after the -o is used as the name of the output file instead of a. out. 
Its typical use is to name the compiled program using the root of the file name. Thus: 

% pc — o myprog myprog.p 

causes the compiled program to be called myprog. 

Generate counters for a prof execution profile — p 

The compiler produces code which counts the number of times each routine is called. 
The profiling is based on a periodic sample taken by the system rather than by inline counters 
used by pxp. This results in less degradation in execution, at somewhat of a loss in accuracy. 
See pro/(l) for a more complete description. 

Run the object code optimizer - O 

The output of the compiler is run through the object code optimizer. This provides an 
increase in compile time in exchange for a decrease in compiled code size and execution time. 

5.6. Options available in Pxp 

Pxp takes, on its command line, a list of options followed by the program file name, 
which must end in ‘.p’ as it must for pi, pc, and pix. Pxp will produce an execution profile if 
any of the z, t or c options is specified on the command line. If none of these options is 
specified, then pxp functions as a program reformatter. 

It is important to note that only the z and w options of pxp, which are common to pi, pc, 
and pxp can be controlled in comments. All other options must be specified on the command 
line to have any effect. 

The following options are relevant to profiling with pxp: 

Include the bodies of all routines in the profile - a 

Pxp normally suppresses printing the bodies of routines which were never executed, to 
make the profile more compact. This option forces all routine bodies to be printed. 

Suppress declaration parts from a profile — d 

Normally a profile includes declaration parts. Specifying d on the command line 
suppresses declaration parts. 

Eliminate include directives - e 

Normally, pxp preserves include directives to the output when reformatting a program, 
as though they were comments. Specifying —e causes the contents of the specified files to be 
reformatted into the output stream instead. This is an easy way to eliminate include 
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directives, e.g. before transporting a program. 

Fully parenthesize expressions - f 

Normally pxp prints expressions with the minimal parenthesization necessary to preserve 
the structure of the input. This option causes pxp to fully parenthesize expressions. Thus the 
statement which prints as 

d : = a + b mod c / e 

with minimal parenthesization, the default, will print as 

d := a + ((b mod c) / e) 
with the f option specified on the command line. 

Left justify all procedures and functions - j 

Normally, each procedure and function body is indented to reflect its static nesting 
depth. This option prevents this nesting and can be used if the indented output would be too 
wide. 

Print a table summarizing procedure and function calls - t 

The t option causes pxp to print a table summarizing the number of calls to each pro¬ 
cedure and function in the program. It may be specified in combination with the z option, 
or separately. 

Enable and control the profile - z 

The z profile option is very similar to the i listing control option of pi. If z is specified 
on the command line, then all arguments up to the source file argument which ends in ‘.p’ are 
taken to be the names of procedures and functions or include files which are to be profiled. 
If this list is null, then the whole file is to be profiled. A typical command for extracting a 
profile of part of a large program would be 

% pxp — z test parser.i compiler.p 

This specifies that profiles of the routines in the file parser, i and the routine test are to be 
made. 

5.7. Formatting programs using pxp 

The program pxp can be used to reformat programs, by using a command of the form 

% pxp dirty.p > clean.p 

Note that since the shell creates the output file ‘clean.p’ before pxp executes, so ‘clean.p’ and 
‘dirty.p’ must not be the same file. 

Pxp automatically paragraphs the program, performing housekeeping chores such as com¬ 
ment alignment, and treating blank lines, lines containing exactly one blank and lines contain¬ 
ing only a form-feed character as though they were comments, preserving their vertical spacing 
effect in the output. Pxp distinguishes between four kinds of comments: 

1) Left marginal comments, which begin in the first column of the input line and are 
placed in the first column of an output line. 

2) Aligned comments, which are preceded by no input tokens on the input line. These 
are aligned in the output with the running program text. 

3) Trailing comments, which are preceded in the input line by a token with no more 
than two spaces separating the token from the comment. 
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4) Right marginal comments, which are preceded in the input line by a token from 
which they are separated by at least three spaces or a tab. These are aligned down 
the right margin of the output, currently to the first tab stop after the 40th column 
from the current “left margin”. 

Consider the following program. 

% cat comments.p 

{ This is a left marginal comment. } 

program hello (output); 

var i : integer; {This is a trailing comment} 

j : integer; {This is a right marginal comment} 

k : array [ 1..10] of array [1..10] of integer; {Marginal, but past the margin} 

{ 

An aligned, multi —line comment 
which explains what this program is 
all about 

1 

begin 

i := 1; {Trailing i comment} 

{A left marginal comment} 

{An aligned comment} 
j := 1; {Right marginal comment} 
k[l] := 1; 
writeln(i, j, k[l]) 
end. 

When formatted by pxp the following output is produced. 

% pxp comments.p 

{ This is a left marginal comment. } 

program hello (output); 
var 

i: integer; {This is a trailing comment} 

j: integer; {This is a right marginal comment} 

k: array [1..10] of array [1..10] of integer; {Marginal, but past the margin} 

An aligned, multi —line comment 
which explains what this program is 
all about 

} 

begin 

i := 1; {Trailing i comment} 

{A left marginal comment} 

{An aligned comment} 

j •= 1; {Right marginal comment} 

k[l] := 1; 
writeln(i, j, k[l]) 
end. 

% 

The following formatting related options are currently available in pxp. The options f and j 
described in the previous section may also be of interest. 
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Strip comments -s 

The s option causes pxp to remove all comments from the input text. 

Underline keywords -_ 

A command line argument of the form —_as in 

% pxp — dirty.p 

can be used to cause pxp to underline all keywords in the output for enhanced readability. 

Specify indenting unit - [23456789] 

The normal unit which pxp uses to indent a structure statement level is 4 spaces. By giv¬ 
ing an argument of the form -d with d a digit, 2 ^ d ^ 9 you can specify that d spaces are to 
be used per level instead. 

5.8. Pxref 

The cross-reference program pxref may be used to make cross-referenced listings of Pas¬ 
cal programs. To produce a cross-reference of the program in the file ‘foo.p’ one can execute 
the command: 

% pxref foo.p 

The cross-reference is, unfortunately, not block structured. Full details on pxref are given in 
its manual section pxref (1). 

5.9. Multi-file programs 

A text inclusion facility is available with Berkeley Pascal. This facility allows the inter¬ 
polation of source text from other files into the source stream of the translator. It can be used 
to divide large programs into more manageable pieces for ease in editing, listing, and mainte¬ 
nance. 

The include facility is based on that of the UNIX C compiler. To trigger it you can place 
the character in the first portion of a line and then, after an arbitrary number of blanks or 
tabs, the word ‘include’ followed by a filename enclosed in single or double quotation 
marks. The file name may be followed by a semicolon V if you wish to treat this as a pseudo- 
Pascal statement. The filenames of included files must end in ‘.i’. An example of the use of 
included files in a main program would be: 

program compiler(input, output, obj); 

^include "globals.i" 

^include "scanner.i" 

#include "parser.i" 

^include "semantics.i" 

begin 

{ main program } 

end. 

At the point the include pseudo-statement is encountered in the input, the lines from 
the included file are interpolated into the input stream. For the purposes of translation and 
runtime diagnostics and statement numbers in the listings and post-mortem backtraces, the 
lines in the included file are numbered from 1. Nested includes are possible up to 10 deep. 

See the descriptions of the i option of pi in section 5.2 above; this can be used to control 
listing when include files are present. 
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When a non-trivial line is encountered in the source text after an include finishes, the 
‘popped’ filename is printed, in the same manner as above. 

For the purposes of error diagnostics when not making a listing, the filename will be 
printed before each diagnostic if the current filename has changed since the last filename was 
printed. 

5.10. Separate Compilation with Pc 

A separate compilation facility is provided with the Berkeley Pascal compiler, pc. This 
facility allows programs to be divided into a number of files and the pieces to be compiled indi¬ 
vidually, to be linked together at some later time. This is especially useful for large programs, 
where small changes would otherwise require time-consuming re-compilation of the entire pro¬ 
gram. 

Normally, pc expects to be given entire Pascal programs. However, if given the —c option 
on the command line, it will accept a sequence of definitions and declarations, and compile 
them into a .o file, to be linked with a Pascal program at a later time. In order that pro¬ 
cedures and functions be available across separately compiled files, they must be declared with 
the directive external. This directive is similar to the directive forward in that it must pre¬ 
cede the resolution of the function or procedure, and formal parameters and function result 
types must be specified at the external declaration and may not be specified at the resolution. 

Type checking is performed across separately compiled files. Since Pascal type defintions 
define unique types, any types which are shared between separately compiled files must be the 
same definition. This seemingly impossible problem is solved using a facility similar to the 
include facility discussed above. Definitions may be placed in files with the extension .h and 
the files included by separately compiled files. Each definition from a .h file defines a unique 
type, and all uses of a definition from the same .h file define the same type. Similarly, the 
facility is extended to allow the definition of consts and the declaration of labels, vars, and 
external functions and procedures. Thus procedures and functions which are used 
between separately compiled files must be declared external, and must be so declared in a .h 
file included by any file which calls or resolves the function or procedure. Conversely, func¬ 
tions and procedures declared external may only be so declared in .h files. These files may 
be included only at the outermost level, and thus define or declare global objects. Note that 
since only external function and procedure declarations (and not resolutions) are allowed in 
.h files, statically nested functions and procedures can not be declared external. 

An example of the use of included .h files in a program would be: 

program compiler (input, output, obj); 

^include "globals.h" 

^include "scanner.h” 

^include "parser.h" 

^include "semantics.h" 

begin 

{ main program j 

end. 

This might include in the main program the definitions and declarations of all the global 
labels, consts, types vars from the file globals.h, and the external function and procedure 
declarations for each of the separately compiled files for the scanner, parser and semantics. 
The header file scanner.h would contain declarations of the form: 

type 

token = record 
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{ token fields j 

end; 

function scan(var inputfile: text): token; 

external; 

Then the scanner might be in a separately compiled file containing: 

^include "globals.h" 

^include "scanner.h" 

function scan; 
begin 

{ scanner code } 

end; 

which includes the same global definitions and declarations and resolves the scanner functions 
and procedures declared external in the file scanner.h. 






A. Appendix to Wirth’s Pascal Report 

This section is an appendix to the definition of the Pascal language in Niklaus Wirth’s 
Pascal Report and, with that Report, precisely defines the Berkeley implementation. This 
appendix includes a summary of extensions to the language, gives the ways in which the 
undefined specifications were resolved, gives limitations and restrictions of the current imple¬ 
mentation, and lists the added functions and procedures available. It concludes with a list of 
differences with the commonly available Pascal 6000-3.4 implementation, and some comments 
on standard and portable Pascal. 

A.l. Extensions to the language Pascal 

This section defines non-standard language constructs available in Berkeley Pascal. The 
s standard Pascal option of the translators pi and pc can be used to detect these extensions in 
programs which are to be transported. 

String padding 

Berkeley Pascal will pad constant strings with blanks in expressions and as value parame¬ 
ters to make them as long as is required. The following is a legal Berkeley Pascal program: 

program x(output); 

var z : packed array [ 1 .. 13 ] of char; 
begin 

z := 'red'; 
writeln(z) 

end; 

The padded blanks are added on the right. Thus the assignment above is equivalent to: 
z := 'red 

which is standard Pascal. 

Octal constants, octal and hexadecimal write 

Octal constants may be given as a sequence of octal digits followed by the character ‘b’ or 
‘B\ The forms 

write(a:n oct) 

and 


write(a:n hex) 

cause the internal representation of expression a, which must be Boolean, character, integer, 
pointer, or a user-defined enumerated type, to be written in octal or hexadecimal respectively. 

Assert statement 

An assert statement causes a Boolean expression to be evaluated each time the state¬ 
ment is executed. A runtime error results if any of the expressions evaluates to be false . The 
assert statement is treated as a comment if run-time tests are disabled. The syntax for 
assert is: 

assert <expr> 









364 











Enumerated type input-output 

Enumerated types may be read and written. On output the string name associated with 
the enumerated value is output. If the value is out of range, a runtime error occurs. On input 
an identifier is read and looked up in a table of names associated with the type of the variable, 
and the appropriate internal value is assigned to the variable being read. If the name is not 
found in the table a runtime error occurs. 

Structure returning functions 

An extension has been added which allows functions to return arbitrary sized structures 
rather than just scalars as in the standard. 

Separate compilation 

The compiler pc has been extended to allow separate compilation of programs. Pro¬ 
cedures and functions declared at the global level may be compiled separately. Type checking 
of calls to separately compiled routines is performed at load time to insure that the program as 
a whole is consistent. See section 5.10 for details. 

A.2. Resolution of the undefined specifications 

File name - file variable associations 

Each Pascal file variable is associated with a named UNIX file. Except for input and out¬ 
put , which are exceptions to some of the rules, a name can become associated with a file in any 
of three ways: 

1) If a global Pascal file variable appears in the program statement then it is associ¬ 
ated with UNIX file of the same name. 

2) If a file was reset or rewritten using the extended two-argument form of reset or 
rewrite then the given name is associated. 

3) If a file which has never had UNIX name associated is reset or rewritten without 
specifying a name via the second argument, then a temporary name of the form 
‘tmp.x’ is associated with the file. Temporary names start with ‘tmp.l’ and con¬ 
tinue by incrementing the last character in the USASCII ordering. Temporary files 
are removed automatically when their scope is exited. 

The program statement 

The syntax of the program statement is: 

program <id> ( <file id> { , <file id > } ) ; 

The file identifiers (other than input and output) must be declared as variables of file type in 
the global declaration part. 

The files input and output 

The formal parameters input and output are associated with the UNIX standard input 
and output and have a somewhat special status. The following rules must be noted: 

1) The program heading must contains the formal parameter output. If input is used, 
explicitly or implicitly, then it must also be declared here. 

2) Unlike all other files, the Pascal files input and output must not be defined in a 
declaration, as their declaration is automatically: 

var input, output: text 
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3) The procedure reset may be used on input. If no UNIX file name has ever been asso¬ 
ciated with input, and no file name is given, then an attempt will be made to 
‘rewind’ input. If this fails, a run time error will occur. Rewrite calls to output act 
as for any other file, except that output initially has no associated file. This means 
that a simple 

rewrite(output) 

associates a temporary name with output. 

Details for files 

If a file other than input is to be read, then reading must be initiated by a call to the pro¬ 
cedure reset which causes the Pascal system to attempt to open the associated UNIX file for 
reading. If this fails, then a runtime error occurs. Writing of a file other than output must be 
initiated by a rewrite call, which causes the Pascal system to create the associated UNIX file 
and to then open the file for writing only. 

Buffering 

The buffering for output is determined by the value of the b option at the end of the 
program statement. If it has its default value 1, then output is buffered in blocks of up to 512 
characters, flushed whenever a writeln occurs and at each reference to the file input. If it has 
the value 0, output is unbuffered. Any value of 2 or more gives block buffering without line or 
input reference flushing. All other output files are always buffered in blocks of 512 characters. 
All output buffers are flushed when the files are closed at scope exit, whenever the procedure 
message is called, and can be flushed using the built-in procedure flush. 

An important point for an interactive implementation is the definition of ‘input f. If 
input is a teletype, and the Pascal system reads a character at the beginning of execution to 
define ‘inputt\ then no prompt could be printed by the program before the user is required to 
type some input. For this reason, ‘inputf is not defined by the system until its definition is 
needed, reading from a file occurring only when necessary. 

The character set 

Seven bit USASCII is the character set used on UNIX. The standard Pascal symbols ‘and’, 
’or’, ’not’, ’< = ’, ’> = ’, ’<>’, and the uparrow ‘f (for pointer qualification) are recognized.f Less 
portable are the synonyms tilde ‘~’ for not, ‘&’ for and, and ‘I’ for or. 

Upper and lower case are considered to be distinct. Keywords and built-in procedure 
and function names are composed of all lower case letters. Thus the identifiers GOTO and 
GOto are distinct both from each other and from the keyword goto. The standard type 
‘boolean’ is also available as ‘Boolean’. 

Character strings and constants may be delimited by the character ‘"’ or by the character 
the latter is sometimes convenient when programs are to be transported. Note that the ‘#’ 
character has special meaning when it is the first character on a line - see Multi-file programs 
below. 

The standard types 

The standard type integer is conceptually defined as 
type integer = minint .. maxint; 


tOn many terminals and printers, the up arrow is represented as a circumflex These are not distinct 
characters, but rather different graphic representations of the same internal codes. 

The proposed standard for Pascal considers them to be the same. 
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Integer is implemented with 32 bit twos complement arithmetic. Predefined constants of type 
integer are: 

const maxint = 2147483647; minint = —2147483648; 

The standard type char is conceptually defined as 
type char = minchar .. maxchar; 

Built-in character constants are ‘minchar’ and ‘maxchar’, ‘bell’ and ‘tab’; ord(minchar) = 0, 
ord(maxchar) = 127. 

The type real is implemented using 64 bit floating point arithmetic. The floating point 
arithmetic is done in ‘rounded’ mode, and provides approximately 17 digits of precision with 
numbers as small as 10 to the negative 38th power and as large as 10 to the 38th power. 

Comments 

Comments can be delimited by either ‘{’ and ‘}’ or by ‘(*’ and ‘*)\ If the character ‘{’ 
appears in a comment delimited by ‘{’ and ‘}’, a warning diagnostic is printed. A similar warn¬ 
ing will be printed if the sequence *(*’ appears in a comment delimited by ‘(*’ and ‘*)\ The res¬ 
triction implied by this warning is not part of standard Pascal, but detects many otherwise 
subtle errors. 

Option control 

Options of the translators may be controlled in two distinct ways. A number of options 
may appear on the command line invoking the translator. These options are given as one or 
more strings of letters preceded by the character ‘-’ and cause the default setting of each given 
option to be changed. This method of communication of options is expected to predominate 
for UNIX. Thus the command 

% pi -1 -s foo.p 

translates the file foo.p with the listing option enabled (as it normally is off), and with only 
standard Pascal features available. 

If more control over the portions of the program where options are enabled is required, 
then option control in comments can and should be used. The format for option control in 
comments is identical to that used in Pascal 6000-3.4. One places the character ‘$’ as the first 
character of the comment and follows it by a comma separated list of directives. Thus an 
equivalent to the command line example given above would be: 

{$l+,s+ listing on, standard Pascal) 

as the first line of the program. The ‘1’ option is more appropriately specified on the command 
line, since it is extremely unlikely in an interactive environment that one wants a listing of the 
program each time it is translated. 

Directives consist of a letter designating the option, followed either by a ‘+’ to turn the 
option on, or by a to turn the option off. The b option takes a single digit instead of a ‘+’ 
or 

Notes on the listings 

The first page of a listing includes a banner line indicating the version and date of gen¬ 
eration of pi or pc. It also includes the UNIX path name supplied for the source file and the 
date of last modification of that file. 

Within the body of the listing, lines are numbered consecutively and correspond to the 
line numbers for the editor. Currently, two special kinds of lines may be used to format the 
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listing: a line consisting of a form-feed character, control-1, which causes a page eject in the 
listing, and a line with no characters which causes the line number to be suppressed in the list¬ 
ing, creating a truly blank line. These lines thus correspond to ‘eject’ and ‘space’ macros found 
in many assemblers. Non-printing characters are printed as the character ‘?’ in the listing.f 

The standard procedure write 

If no minimum field length parameter is specified for a write, the following default values 
are assumed: 


integer 

10 

real 

22 

Boolean 

length of ‘true’ or ‘false’ 

char 

1 

string 

length of the string 

oct 

11 

hex 

8 


The end of each line in a text file should be explicitly indicated by ‘writeln(f)’, where 
‘writeln(output)’ may be written simply as ‘writeln’. For UNIX, the built-in function ‘page(f)’ 
puts a single ASCII form-feed character on the output file. For programs which are to be tran¬ 
sported the filter pcc can be used to interpret carriage control, as UNIX does not normally do 
so. 


A.3. Restrictions and limitations 
Files 

Files cannot be members of files or members of dynamically allocated structures. 

Arrays, sets and strings 

The calculations involving array subscripts and set elements are done with 16 bit arith¬ 
metic. This restricts the types over which arrays and sets may be defined. The lower bound of 
such a range must be greater than or equal to -32768, and the upper bound less than 32768. In 
particular, strings may have any length from 1 to 65535 characters, and sets may contain no 
more than 65535 elements. 

Line and symbol length 

There is no intrinsic limit on the length of identifiers. Identifiers are considered to be 
distinct if they differ in any single position over their entire length. There is a limit, however, 
on the maximum input line length. This limit is quite generous however, currently exceeding 
160 characters. 

Procedure and function nesting and program size 

At most 20 levels of procedure and function nesting are allowed. There is no funda¬ 
mental, translator defined limit on the size of the program which can be translated. The ulti¬ 
mate limit is supplied by the hardware and thus, on the PDP-11, by the 16 bit address space. If 
one runs up against the ‘ran out of memory’ diagnostic the program may yet translate if 
smaller procedures are used, as a lot of space is freed by the translator at the completion of 
each procedure or function in the current implementation. 

On the VAX-11, there is an implementation defined limit of 65536 bytes per variable. 
There is no limit on the number of variables. 

fThe character generated by a control-i indents to the next ‘tab stop’. Tab stops are set every 8 columns in 
UNIX. Tabs thus provide a quick way of indenting in the program. 
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Overflow 

There is currently no checking for overflow on arithmetic operations at run-time on the 
PDP-11. Overflow checking is performed on the VAX- 11 by the hardware. 

A.4. Added types, operators, procedures and functions 

Additional predefined types 

The type alfa is predefined as: 

type alfa = packed array [ 1..10 ] of char 
The type intset is predefined as: 
type intset = set of 0..127 

In most cases the context of an expression involving a constant set allows the translator to 
determine the type of the set, even though the constant set itself may not uniquely determine 
this type. In the cases where it is not possible to determine the type of the set from local con¬ 
text, the expression type defaults to a set over the entire base type unless the base type is 
integerf. In the latter case the type defaults to the current binding of intset , which must be 
“type set of (a subrange of) integer” at that point. 

Note that if intset is redefined via: 
type intset = set of 0..58; 

then the default integer set is the implicit intset of Pascal 6000-3.4 

Additional predefined operators 

The relational ‘<’ and ‘>’ of proper set inclusion are available. With a and b sets, note 

that 

(not (a < b)) <> (a >= b) 

As an example consider the sets a = [0,2] and b = [1]. The only relation true between these 
sets is ‘O’. 

Non-standard procedures 

argv(i,a) where i is an integer and a is a string variable assigns the (possibly 

truncated or blank padded) i’th argument of the invocation of the 
current UNIX process to the variable a. The range of valid i is 0 to 
argc-1. 

date(a) assigns the current date to the alfa variable a in the format ‘dd 

mmm yy ’, where ‘mmm’ is the first three characters of the month, 
i.e. ‘Apr’. 

flush(f) writes the output buffered for Pascal file f into the associated UNIX 

file. 

halt terminates the execution of the program with a control flow back¬ 

trace. 

linelimit(f,x)J with f a textfile and x an integer expression causes the program to 

+The current translator makes a special case of the construct ‘if ... in [ ... ]’ and enforces only the more lax 
restriction on 16 bit arithmetic given above in this case. 

^Currently ignored by pdp-11 px. 
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message(x,...) 

null 

remove (a) 
reset(f,a) 

rewrite (f, a) 
stlimit(i) 

time (a) 


be abnormally terminated if more than x lines are written on file /. 
If x is less than 0 then no limit is imposed. 

causes the parameters, which have the format of those to the 
built-in procedure write, to be written unbuffered on the diagnos¬ 
tic unit 2, almost always the user’s terminal. 

a procedure of no arguments which does absolutely nothing. It is 
useful as a place holder, and is generated by pxp in place of the 
invisible empty statement. 

where a is a string causes the UNIX file whose name is a, with trail¬ 
ing blanks eliminated, to be removed. 

where a is a string causes the file whose name is a (with blanks 
trimmed) to be associated with f in addition to the normal function 
of reset 

is analogous to ‘reset’ above. 

where i is an integer sets the statement limit to be i statements. 
Specifying the p option to pc disables statement limit counting. 

causes the current time in the form ‘ hh:mm:ss ’ to be assigned to 
the alfa variable a. 


Non-standard functions 


argc 

card(x) 

clock 

expo(x) 

random(x) 


seed(i) 

sysclock 

undefined(x) 

wallclock 


returns the count of arguments when the Pascal program was 
invoked. Argc is always at least 1. 

returns the cardinality of the set x, i.e. the number of elements con¬ 
tained in the set. 

returns an integer which is the number of central processor mil¬ 
liseconds of user time used by this process. 

yields the integer valued exponent of the floating-point representa¬ 
tion of x; expo(x) = entier(log2(abs(x))). 

where x is a real parameter, evaluated but otherwise ignored, 
invokes a linear congruential random number generator. Successive 
seeds are generated as (seed*a 4- c) mod m and the new random 
number is a normalization of the seed to the range 0.0 to 1.0; a is 
62605, c is 113218009, and m is 536870912. The initial seed is 
7774755. 

where i is an integer sets the random number generator seed to i 
and returns the previous seed. Thus seed(seed(i)) has no effect 
except to yield value i. 

an integer function of no arguments returns the number of central 
processor milliseconds of system time used by this process. 

a Boolean function. Its argument is a real number and it always 
returns false. 

an integer function of no arguments returns the time in seconds 
since 00:00:00 GMT January 1, 1970. 


A.5. Remarks on standard and portable Pascal 

It is occasionally desirable to prepare Pascal programs which will be acceptable at other 
Pascal installations. While certain system dependencies are bound to creep in, judicious design 
and programming practice can usually eliminate most of the non-portable usages. Wirth’s Pas¬ 
cal Report concludes with a standard for implementation and program exchange. 
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In particular, the following differences may cause trouble when attempting to transport 
programs between this implementation and Pascal 6000-3.4. Using the s translator option 
may serve to indicate many problem areas.*)* 

Features not available in Berkeley Pascal 

Segmented files and associated functions and procedures. 

The function trunc with two arguments. 

Arrays whose indices exceed the capacity of 16 bit arithmetic. 

Features available in Berkeley Pascal but not in Pascal 6000-3.4 

The procedures reset and rewrite with file names. 

The functions argc , seed, sysclock, and wallclock. 

The procedures argv, flush , and remove. 

Message with arguments other than character strings. 

Write with keyword hex. 

The assert statement. 

Reading and writing of enumerated types. 

Allowing functions to return structures. 

Separate compilation of programs. 

Comparison of records. 

Other problem areas 

Sets and strings are more general in Berkeley Pascal; see the restrictions given in the 
Jensen-Wirth User Manual for details on the 6000-3.4 restrictions. 

The character set differences may cause problems, especially the use of the function chr, 
characters as arguments to ord, and comparisons of characters, since the character set ordering 
differs between the two machines. 

The Pascal 6000-3.4 compiler uses a less strict notion of type equivalence. In Berkeley 
Pascal, types are considered identical only if they are represented by the same type identifier. 
Thus, in particular, unnamed types are unique to the variables/fields declared with them. 

Pascal 6000-3.4 doesn’t recognize our option flags, so it is wise to put the control of 
Berkeley Pascal options to the end of option lists or, better yet, restrict the option list length 
to one. 

For Pascal 6000-3.4 the ordering of files in the program statement has significance. It is 
desirable to place input and output as the first two files in the program statement. 


+The s option does not, however, check that identifiers differ in the first 8 characters. Pi and pc also do not 
check the semantics of packed. 
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A Portable Fortran 77 Compiler 

S. I. Feldman 

P. J. Weinberger 

Bell Laboratories 
Murray Hill, New Jersey 07974 


1. INTRODUCTION 

The Fortran language has been revised. The new language, known as Fortran 77, became an 
official American National Standard [1] on April 3, 1978. Fortran 77 supplants 1966 Standard 
Fortran [2]. We report here on a compiler and run-time system for the new extended language. 
The compiler and computation library were written by S.I.F., the I/O system by P.J.W. We 
believe ours to be the first complete Fortran 77 system to be implemented. This compiler is 
designed to be portable to a number of different machines, to be correct and complete, and to 
generate code compatible with calling sequences produced by compilers for the C language [3]. 
In particular, it is in use on UNIX systems. Two families of C compilers are in use at Bell 
Laboratories, those based on D. M. Ritchie’s PDP-11 compiler [4] and those based on S. C. 
Johnson’s portable C compiler [5]. This Fortran compiler can drive the second passes of either 
family. In this paper, we describe the language compiled, interfaces between procedures, and 
file formats assumed by the I/O system. We will describe implementation details in companion 
papers. 

1.1. Usage 

At present, versions of the compiler run on and compile for the PDP-11, the VAX-11/780, 
and the Interdata 8/32 UNIX systems. The command to run the compiler is 

f 77 flags file . .. 

f77 is a general-purpose command for compiling and loading Fortran and Fortran-related 
files. EFL [6] and Ratfor [7] source files will be preprocessed before being presented to 
the Fortran compiler. C and assembler source files will be compiled by the appropriate 
programs. Object files will be loaded. (The f 77 and cc commands cause slightly different 
loading sequences to be generated, since Fortran programs need a few extra libraries and a 
different startup routine than do C programs.) The following file name suffixes are under¬ 
stood: 

.f Fortran source file 
.F Fortran source file 
.e EFL source file 
.r Ratfor source file 
.c C source file 
.s Assembler source file 
.o Object file 

Arguments whose names end with .f are taken to be Fortran 77 source programs; they are 
compiled, and each object program is left on the file in the current directory whose name 
is that of the source with .o substituted for .f. 

Arguments whose names end with .F are also taken to be Fortran 77 source programs; 
these are first processed by the C preprocessor before being compiled by f77. 
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Arguments whose names end with .r or .e are taken to be Ratfor or EFL source programs, 
respectively; these are first transformed by the appropriate preprocessor, then compiled by 
f77. 

In the same way, arguments whose names end with .c or .$ are taken to be C or assembly 
source programs and are compiled or assembled, producing a .0 file. 

The following flags are understood: 

—c Compile but do not load. Output for x.f, x.F, x.e, x.r, x.c, or x.s is put 

on file x.o. 

—g Have the compiler produce additional symbol table information for 

dbx(l). This only applies on the Vax UNIX system. Do not use with -O. 

—12 On machines which support short integers, make the default integer con¬ 

stants and variables short (see section 2.14). (-14 is the standard value 
of this option). All logical quantities will be short. 

~n» Apply the M4 macro preprocessor to each EFL or Ratfor source file before 

using the appropriate compiler. 

-0 file Put executable module on file file. (Default is a.ont). 

—onetrip Compile code that performs every do loop at least once (see section 2.12). 

~P Generate code to produce usage profiles. 

—PI Generate code in the manner of —p, but invoke a run-time recording 

mechanism that keeps more extensive statistics. 

—w Suppress all warning messages. 

—w66 Suppress warnings about Fortran 66 features used. 

-« Make the default type of a variable undefined (see section 2.3). 

~C Compile code that checks that subscripts are within array bounds. 

—Dname—def 

—Dname Define the name to the C preprocessor, as if by ‘#define’. If no de fini tion 
is given, the name is defined as "1”. (.F files only). 

~E str Use the string str as an EFL option in processing .e files. 

~F Ratfor and and EFL source programs are pre-processed into Fortran files, 

but those files are not compiled or removed. 

“T dir *#include’ files whose names do not begin with */’ are always sought first 

in the directory of the file argument, then in directories named in —I 
options, then in directories on a standard list. (.F files only). 

—O Invoke the object code optimizer. Do not use with —g. 

—Rs/r Use the string str as a Ratfor option in processing ,r files. 

-U Do not convert upper case letters to lower case. The default is to convert 

Fortran programs to lower case except within character string constants. 

~S Generate assembler output for each source file, but do not assemble it. 

Assembler output for a source file x.f, x.F, x.e, x.r, or x.c is put on file 

x.s. 

Other flags, all library names (arguments beginning —1), and any names not ending with 
one of the understood suffixes are passed to the loader. 


1.2. Documentation Conventions 

In running text, we write Fortran keywords and other literal strings in boldface lower case. 
Examples will be presented in lightface lower case. Names representing a class of values 
will be printed in italics. 
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1.3. Implementation Strategy 

The compiler and library are written entirely in C. The compiler generates C compiler 
intermediate code. Since there are C compilers running on a variety of machines, rela¬ 
tively small changes will make this Fortran compiler generate code for any of them. 
Furthermore, this approach guarantees that the resulting programs are compatible with C 
usage. The runtime computational library is complete. The runtime I/O library makes 
use of D. M. Ritchie’s Standard C I/O package [8] for transferring data. With the few 
exceptions described below, only documented calls are used, so it should be relatively 
easy to modify to run on other operating systems. 

2. LANGUAGE EXTENSIONS 

Fortran 77 includes almost all of Fortran 66 as a subset. We describe the differences briefly in 
Appendix A. The most important additions are a character string data type, file-oriented 
input/output statements, and random access I/O. Also, the language has been cleaned up con¬ 
siderably. 

In addition to implementing the language specified in the new Standard, our compiler imple¬ 
ments a few extensions described in this section. Most are useful additions to the language. 
The remainder are extensions to make it easier to communicate with C procedures or to permit 
compilation of old (1966 Standard) programs. 

2.1. Doable Complex Date Type 

The new type doable complex is defined. Each datum is represented by a pair of double 
precision real variables. A double complex version of every complex built-in function is 
provided. The specific function names begin with x instead of c. 

2.2. Internal Files 

The Fortran 77 standard introduces “internal files” (memory arrays), but restricts their 
use to formatted sequential I/O statements. Our I/O system also permits internal files to 
be used in formatted direct reads and writes. 

2.3. Implicit Undefined Statement 

Fortran 66 has a fixed rule that the type of a variable that does not appear in a type state¬ 
ment is Integer if its first letter is 1, j, k, 1, m or n, and real otherwise. Fortran 77 has an 
Implicit statement for overriding this rule. As an aid to good programming practice, we 
permit an additional type, undefined. The statement 

implicit undefined (a-z) 

turns off the automatic data typing mechanism, and the compiler will issue a diagnostic 
for each variable that is used but does not appear in a type statement. Specifying the —a 
compiler flag is equivalent to beginning each procedure with this statement. 

2.4. Recursion 

Procedures may call themselves, directly or through a chain of other procedures. 

2.5. Automatic Storage 

Two new keywords are recognized, static and automatic. These keywords may appear as 
“types” in type statements and in implicit statements. Local variables are static by 
default; there is exactly one copy of the datum, and its value is retained between calls. 
There is one copy of each variable declared automatic for each invocation of the pro¬ 
cedure. Automatic variables may not appear in equivalence, data, or save statements. 
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2.6. Source Input Format 

The Standard expects input to the compiler to be in 72-column format: except in com¬ 
ment lines, the first five characters are the statement number, the next is the continuation 
character, and the next 66 are the body of the line. (If there are fewer than 72 characters 
on a line, the compiler pads it with blanks; characters after the seventy-second are 
ignored.) 

In order to make it easier to type Fortran programs, our compiler also accepts input in 
variable length lines. An ampersand in the first position of a line indicates a con¬ 
tinuation line; the remaining characters form the body of the line. A tab character in one 
of the first six positions of a line signals the end of the statement number and continua¬ 
tion part of the line; the remaining characters form the body of the line. A tab elsewhere 
on the line is treated as another kind of blank by the compiler. 

In the Standard, there are only 26 letters — Fortran is a one-case language. Consistent 
with ordinary UNIX system usage, our compiler expects lower case input. By default, the 
compiler converts all upper case characters to lower case except those inside character 
constants. However, if the — U compiler flag is specified, upper case letters are not 
transformed. In this mode, it is possible to specify external names with upper case letters 
in them, and to have distinct variables differing only in case. Regardless of the setting of 
the flag, keywords will only be recognized in lower case. 

2.7. Include Statement 
The statement 

include 'stuff* 

is replaced by the contents of the file stuff; include statements may be nested to a reason¬ 
able depth, currently ten. 

2.8. Binary Initialization Constants 

A variable may be initialized in a data statement by a binary constant, denoted by a letter 
followed by a quoted string. If the letter is b, the string is binary, and only zeroes and 
ones are permitted. If the letter is o, the string is octal, with digits 0—7. If the letter is z 
or x, the string is hexadecimal, with digits 0—9, a—f. Thus, the statements 

integer a(3) 

data a / b'1010', o!2', z'a' / 
initialize all three elements of a to ten. 

2.9. Character Strings 

For compatibility with C usage, the following backslash escapes are recognized: 

\n newline 
\t tab 
\b backspace 
\f form feed 
\0 null 

V apostrophe (does not terminate a string) 

\* quotation mark (does not terminate a string) 

\\ \ 

\x x, where x is any other character 

Fortran 77 only has one quoting character, the apostrophe. Our compiler and I/O system 
recognize both the apostrophe “ ' ” and the double-quote “ " If a string begins with 
one variety of quote mark, the other may be embedded within it without using the 
repeated quote or backslash escapes. 
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Each character string constant appearing outside a data statement is followed by a null 
character to ease communication with C routines. 

2.10. Hollerith 

Fortran 77 does not have the old Hollerith “nh" notation, though the new Standard 
recommends implementing the old Hollerith feature in order to improve compatibility 
with old programs. In our compiler, Hollerith data may be used in place of character 
string constants, and may also be used to initialize non-character variables in data state¬ 
ments. 

2.11. Equivalence Statements 

As a very special and peculiar case, Fortran 66 permits an element of a multiply- 
dimensioned array to be represented by a singly-subscripted reference in equivalence 
statements. Fortran 77 does not permit this usage, since subscript lower bounds may now 
be different from 1. Our compiler permits single subscripts in equivalence statements, 
under the interpretation that all missing subscripts are equal to 1. A warning message is 
printed for each such incomplete subscript. 

2.12. One-Trip DO Loops 

The Fortran 77 Standard requires that the range of a do loop not be performed if the ini¬ 
tial value is already past the limit value, as in 

do 10 i - 2, 1 

The 1966 Standard stated that the effect of such a statement was undefined, but it was 
common practice that the range of a do loop would be performed at least once. In order 
to accommodate old programs, though they were in violation of the 1966 Standard, the 
—onetrip compiler flag causes non-standard loops to be generated. 

2.13. Commas In Formatted Input 

The I/O system attempts to be more lenient than the Standard when it seems worthwhile. 
When doing a formatted read of non-character variables, commas may be used as value 
separators in the input record, overriding the field lengths given in the format statement. 
Thus, the format 

010, f20.10, i4) 

will read the record 

—345,.05e—3,12 

correctly. 

2.14. Short Integers 

On machines that support halfword integers, the compiler accepts declarations of type 
lnteger«2. (Ordinary integers follow the Fortran rules about occupying the same space as 
a REAL variable; they are assumed to be of C type long lnt; halfword integers are of C 
type short lnt.) An expression involving only objects of type integer*2 is of that type. 
Generic functions return short or long integers depending on the actual types of their 
arguments. If a procedure is compiled using the —12 flag, all small integer constants will 
be of type lnteger*2. If the precision of an integer-valued intrinsic function is not deter¬ 
mined by the generic function rules, one will be chosen that returns the prevailing length 
(integer*2 when the —12 command flag is in effect). When the —12 option is in effect, all 
quantities of type logical will be short. Note that these short integer and logical quantities 
do not obey the standard rules for storage association. 
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2.15. Additional Intrinsic Functions 

This compiler supports all of the intrinsic functions specified in the Fortran 77 Standard. 
In addition, there are functions for performing bitwise Boolean operations (or, and, xor, 
and not) and for accessing the UNIX command arguments (getarg and large) and environ¬ 
ment (getenv). 

3. VIOLATIONS OF THE STANDARD 

We know only a few ways in which our Fortran system violates the new standard: 

3.1. Doable Precision Alignment 

The Fortran Standards (both 1966 and 1977) permit common or equivalence statements 
to force a double precision quantity onto an odd word boundary, as in the following exam¬ 
ple: 

real a(4) 

double precision b,c 
equivalence (a(l),b), (a(4),c) 

Some machin es (e.g., Honeywell 6000, IBM 360) require that double precision quantities 
be on double word boundaries; other machines (e.g., IBM 370), run inefficiently if this 
alignment rule is not observed. It is possible to tell which equivalenced and common 
variables suffer from a forced odd alignment, but every double precision argument would 
have to be assumed on a bad boundary. To load such a quantity on some machines, it 
would be necessary to use separate operations to move the upper and lower halves into 
the halves of an aligned temporary, then to load that double precision temporary; the 
reverse would be needed to store a result. We have chosen to require that all double pre¬ 
cision real and complex quantities fall on even word boundaries on machines with 
corresponding hardware requirements, and to issue a diagnostic if the source code 
demands a violation of the rule. 

3.2. Dummy Procedure Arguments 

If any argument of a procedure is of type character, all dummy procedure arguments of 
that procedure must be declared in an external statement. This requirement arises as a 
subtle corollary of the way we represent character string arguments and of the one-pass 
nature of the compiler. A warning is printed if a dummy procedure is not declared exter¬ 
nal. Code is correct if there are no character arguments. 

3.3. T and TL Formats 

The implementation of the t (absolute tab) and tl (leftward tab) format codes is defective. 
These codes allow rereading or rewriting part of the record which has already been pro¬ 
cessed (section 6.3.2 in Appendix A). The implementation uses seeks, so if the unit is 
not one which allows seeks, such as a terminal, the program is in error. A benefit of the 
implementation chosen is that there is no upper limit on the length of a record, nor is it 
necessary to predeclare any record lengths except where specifically required by Fortran or 
the operating system. 

3.4. Carriage Control 

The Standard leaves as implementation dependent which logical unit(s) are treated as 
“printer” files. In this implementation there is no printer file and thus no carriage control 
is recognized on formatted output, except by special arrangement [9]. 
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3.5. Assigned Goto 

The optional list associated with an assigned goto statement is not checked against the 
actual assigned value during execution. 

4. INTER-PROCEDURE INTERFACE 

To be able to write C procedures that call or are called by Fortran procedures, it is necessary to 
know the conventions for procedure names, data representation, return values, and argument 
lists that the compiled code obeys. 

4.1. Procedure Names 

On UNIX systems, the name of a common block or a Fortran procedure has an underscore 
appended to it by the compiler to distinguish it from a C procedure or external variable 
with the same user-assigned name. Fortran library procedure names have embedded 
underscores to avoid clashes with user-assigned subroutine names. 

4.2. Data Representations 

The following is a table of corresponding Fortran and C declarations: 


Fortran 


C 


integer* 2 x 
integer x 
logical x 
real x 

double precision x 
complex x 
double complex x 
characters x 


short int x; 
long int x; 
long int x; 
float x; 


double x; 

struct { float r, i;} x; 
struct { double dr, di;} x; . 
char x[6]; 


(By the rules of Fortran, integer, logical, and real data occupy the same amount of 
memory.) 

4.3. Return Values 

A function of type integer, logical, real, or double precision declared as a C function 
returns the corresponding type. A complex or double complex function is equivalent to a 
C routine with an additional initial argument that points to the place where the return 
value is to be stored. Thus, 

complex function f(...) 

is equivalent to 

f_(temp,...) 

struct { float r, i;) *temp; 

A character-valued function is equivalent to a C routine with two extra initial arguments: 
a data address and a length. Thus, 

character*15 function g(...) 

is equivalent to 

g_(result, length, ...) 
char result! ]; 
long int length; 

and could be invoked in C by 
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char chars [IS]; 
g_(chars, 15L,...); 

Subroutines are invoked as if they were integer-valued functions whose value specifies 
which alternate return to use. Alternate return arguments (statement labels) are not 
passed to the function, but are used to do an indexed branch in the calling procedure. (If 
the subroutine has no entry points with alternate return arguments, the returned value is 
undefined.) The statement 

call nret(*l, *2, *3) 

is treated exactly as if it were the computed goto 
goto (1, 2, 3), nret() 

4.4. Argument Lists 

All Fortran arguments are passed by address. In addition, for every argument that is of 
type character or that is a dummy procedure, an argument giving the length of the value 
is passed. (The string lengths are long int quantities passed by value.) The order of argu¬ 
ments is then: 

Extra arguments for complex and character functions 
Address for each datum or function 
A long int for each character or procedure argument 

Thus, the call in 

external f 
character*? s 
integer b(3) 

call sam(f, b(2), s) 

is equivalent to that in 

int f(); 
char s[7]; 
long int b[3); 

samjf, &b[l], s, OL, 7L); 

Note that the first element of a C array always has subscript zero, but Fortran arrays begin 
at 1 by default. Fortran arrays are stored in column-major order, C arrays are stored in 
row-major order. 

5. FILE FORMATS 

5.1. Structure of Fortran Files 

Fortran requires four kinds of external files: sequential formatted and unformatted, and 
direct formatted and unformatted. On UNIX systems, these are all implemented as ordi¬ 
nary files which are assumed to have the proper internal structure. 

Fortran I/O is based on records. When a direct file is opened in a Fortran program, the 
record length of the records must be given, and this is used by the Fortran I/O system to 
make the file look as if it is made up of records of the given length. In the special case 
that the record length is given as 1, the files are not considered to be divided into records, 
but are treated as byte-addressable byte strings; that is, as ordinary UNIX file system files. 
(A read or write request on such a file keeps consuming bytes until satisfied, rather than 
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being restricted to a single record.) 

The peculiar requirements on sequential unformatted files make it unlikely that they will 
ever be read or written by any means except Fortran I/O statements. Each record is pre¬ 
ceded and followed by an integer containing the record’s length in bytes. 

The Fortran I/O system breaks sequential formatted files into records while reading by 
using each newline as a record separator. The result of reading off the end of a record is 
undefined according to the Standard. The I/O system is permissive and treats the record 
as being extended by blanks. On output, the I/O system will write a newline at the end of 
each record. It is also possible for programs to write newlines for themselves. This is an 
error, but the only effect will be that the single record the user thought he wrote will be 
treated as more than one record when being read or backspaced over. 

5.2. Portability Considerations 

The Fortran I/O system uses only the facilities of the standard C I/O library, a widely 
available and fairly portable package, with the following two nonstandard features: the I/O 
system needs to know whether a file can be used for direct I/O, and whether or not it is 
possible to backspace. Both of these facilities are implemented using the fseek routine, so 
there is a routine canseek which determines if fseek will have the desired effect. Also, 
the Inquire statement provides the user with the ability to find out if two files are the 
same, and to get the name of an already opened file in a form which would enable the 
program to reopen it. Therefore there are two routines which depend on facilities of the 
operating system to provide these two services. In any case, the I/O system runs on the 
PDP-11, VAX-11/780, and Interdata 8/32 UNIX systems. 

5.3. Pre-Connected Files and File Positions 

Units 5, 6, and 0 are preconnected when the program starts. Unit 5 is connected to the 
standard input, unit 6 is connected to the standard output, and unit 0 is connected to the 
standard error unit. All are connected for sequential formatted I/O. 

All the other units are also preconnected when execution begins. Unit n is connected to 
a file named fort.n. These files need not exist, nor will they be created unless their units 
are used without first executing an open. The default connection is for sequential format¬ 
ted I/O. 

The Standard does not specify where a file which has been explicitly opened for sequential 
I/O is initially positioned. The I/O system will position the file at the beginning. There¬ 
fore a write will destroy any data already in the file, but a read will work reasonably. To 
position a file to its end, use a ‘read’ loop, or the system dependent ’fseek’ function. The 
preconnected units 0, 5, and 6 are positioned as they come from the program’s parent 
process. 
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APPENDIX A: Differences Between Fortran 66 and Fortran 77 

The following is a very brief description of the differences between the 1966 [2] and the 1977 
(1] Standard languages. We assume that the reader is familiar with Fortran 66. We do not pre¬ 
tend to be complete, precise, or unbiased, but plan to describe what we feel are the most 
important aspects of the new language. The best current information on the 1977 Standard is 
in publications of the X3J3 Subcommittee of the American National Standards Institute, and the 
ANSI X3.9-1978 document, the official description of the language. The Standard is written in 
English rather than a meta-language, but it is forbidding and legalistic. A number of tutorials 
and textbooks are available (see Appendix B). 

1. Features Deleted from Fortran 66 

1.1. Hollerith 

All notions of “Hollerith” (nh) as data have been officially removed, although our com¬ 
piler, like almost all in the foreseeable future, will continue to support this archaism. 

1.2. Extended Range 

In Fortran 66, under a set of very restrictive and rarely-understood conditions, it is per¬ 
missible to jump out of the range of a do loop, then jump back into it. Extended range 
has been removed in the Fortran 77 language. The restrictions are so special, and the 
implementation of extended range is so unreliable in many compilers, that this change 
really counts as no loss. 

2. Program Form 

2.1. Blank Lines 

Completely blank lines are now legal comment lines. 

2.2. Program and Block Data Statements 

A main program may now begin with a statement that gives that program an external 
name: 

program work 

Block data procedures may also have names, 
block data stuff 

There is now a rule that only one unnamed block data procedure may appear in a pro¬ 
gram. (This rule is not enforced by our system.) The Standard does not specify the effect 
of the program and block data names, but they are clearly intended to aid conventional 
loaders. 

2.3. ENTRY Statement 

Multiple entry points are now legal. Subroutine and function subprograms may have addi¬ 
tional entry points, declared by an entry statement with an optional argument list. 

entry extra (a, b, c) 

Execution begins at the first statement following the entry line. All variable declarations 
must precede all executable statements in the procedure. If the procedure begins with a 
subroutine statement, all entry points are subroutine names. If it begins with a function 
statement, each entry is a function entry point, with type determined by the type declared 
for the entry name. If any entry is a character-valued function, then all entries must be. 
In a function, an entry name of the same type as that where control entered must be 
assigned a value. Arguments do not retain their values between calls. (The ancient trick 
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of calling one entry point with a large number of arguments to cause the procedure to 
“remember” the locations of those arguments, then invoking an entry with just a few 
arguments for later calculation, is still illegal. Furthermore, the trick doesn’t work in our 
implementation, since arguments are not kept in static storage.) 

2.4. DO Loops 

do variables and range parameters may now be of integer, real, or double precision types. 
(The use of floating point do variables is very dangerous because of the possibility of 
unexpected roundoff, and we strongly recommend against their use.) The action of the do 
statement is now defined for all values of the do parameters. The statement 

do 10 i — 1, u, d 

performs max(0, [(u-l+d)/d\) iterations. The do variable has a predictable value when 
exiting a loop: the value at the time a goto or return terminates the loop; otherwise the 
value that failed the limit test. 

2.5. Alternate Returns 

In a subroutine or subroutine entry statement, some of the arguments may be noted by 
an asterisk, as in 

subroutine s(a, •, b, •) 

The meaning of the “alternate returns” is described in section 5.2 of Appendix A. 

3. Declarations 

3.1. CHARACTER Data Type 

One of the biggest improvements to the language is the addition of a character-string data 
type. Local and common character variables must have a length denoted by a constant 
expression: 

character* 17 a, b(3,4) 
character*(6+3) c 

If the length is omitted entirely, it is assumed equal to 1. A character string argument 
may have a constant length, or the length may be declared to be the same as that of the 
corresponding actual argument at run time by a statement like 

character* (•) a 

(There is an intrinsic function len that returns the actual length of a character string.) 
Character arrays and common blocks containing character variables must be packed: in an 
array of character variables, the first character of one element must follow the last charac¬ 
ter of the preceding element, without holes. 

3.2. IMPLICIT Statement 

The traditional implied declaration rules still hold: a variable whose name begins with i, J, 
k, 1, m, or n is of type integer; other variables are of type real, unless otherwise declared. 
This general rule may be overridden with an implicit statement: 

implicit real(a-c,g), complex (w-z), character* (17) (s) 

declares that variables whose name begins with an a ,b, c, or g are real, those beginning 
with w, x, y, or z are assumed complex, and so on. It is still poor practice to depend on 
implicit typing, but this statement is an industry standard. 
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3.3. PARAMETER Statement 

It is now possible to give a constant a symbolic name, as in 

parameter (x—17, y*x/3, pi—3.14159d0, s—'hello') 

The type of each parameter name is governed by the same implicit and explicit rules as 
for a variable. The right side of each equal sign must be a constant expression (an 
expression made up of constants, operators, and already defined parameters). 

3.4. Array Declarations 

Arrays may now have as many as seven dimensions. (Only three were permitted in 
1966.) The lower bound of each dimension may be declared to be other than 1 by using a 
colon. Furthermore, an adjustable array bound may be an integer expression involving 
constants, arguments, and variables in common. 

real a(-5:3, 7, m:n), b(n+l:2*n) 

The upper bound on the last dimension of an array argument may be denoted by an aster¬ 
isk to indicate that the upper bound is not specified: 

integer a(5, •), b(»), c(0:l, -2:») 

3.5. SAVE Statement 

A poorly known rule of Fortran 66 is that local variables in a procedure do not necessarily 
retain their values between invocations of that procedure. At any instant in the execution 
of a program, if a common block is declared neither in the currently executing procedure 
nor in any of the procedures in the chain of callers, all of the variables in that common 
block also become undefined. (The only exceptions are variables that have been defined 
in a data statement and never changed.) These rules permit overlay and stack implemen¬ 
tations for the affected variables. Fortran 77 permits one to specify that certain variables 
and common blocks are to retain their values between invocations. The declaration 

save a, /b/, c 

leaves the values of the variables • and c and all of the contents of common block b 
unaffected by a return. The simple declaration 

save 

has this effect on all variables and common blocks in the procedure. A common block 
must be saved in every procedure in which it is declared if the desired effect is to occur. 

3.6. INTRINSIC Statement 

All of the functions specified in the Standard are in a single category, “intrinsic func¬ 
tions’ 1 , rather than being divided into “intrinsic” and “basic external” functions. If an 
intrinsic function is to be passed to another procedure, it must be declared Intrinsic. 
Declaring it external (as in Fortran 66) causes a function other than the built-in one to be 
passed. 

4. Expressions 

4.1. Character Constants 

Character string constants are marked by strings surrounded by apostrophes. If an apos¬ 
trophe is to be included in a constant, it is repeated: 

'abc' 

'aiti'Y 
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There are no null (zero-length) character strings in Fortran 77. Our compiler has two 
different quotation marks, “ ' ” and “ " ”. (See section 2.9 in the main text.) 

4.2. Concatenation 

One new operator has been added, character string concatenation, marked by a double 
slash *7/”. The result of a concatenation is the string containing the characters of the 
left operand followed by the characters of the right operand. The strings 

'ab' // 'cd' 

'abed' 


are equal. The strings being concatenated must be of constant length in all concatenations 
that are not the right sides of assignments. (The only concatenation expressions in which 
a character string declared adjustable with a “•(•)” modifier or a substring denotation 
with nonconstant position values may appear are the right sides of assignments.) 

4.3. Character String Assignment 

The left and right sides of a character assignment may not share storage. (The assumed 
implementation of character assignment is to copy characters from the right to the left 
side.) If the left side is longer than the right, it is padded with blanks. If the left side is 
shorter than the right, trailing characters are discarded. 

4.4. Substrings 

It is possible to extract a substring of a character variable or character array element, using 
the colon notation: 

a(i,j) (m:n) 

is the string of (n-m+1) characters beginning at the m' h character of the character array 
element a^. Results are undefined unless m<n. Substrings may be used on the left 
sides of assignments and as procedure actual arguments. 

4.5. Exponentiation 

It is now permissible to raise real quantities to complex powers, or complex quantities to 
real or complex powers. (The principal part of the logarithm is used.) Also, multiple 
exponentiation is now defined: 

a«*b**c is equivalent to a •• (b**c) 


4.6. Relaxation of Restrictions 

Mixed mode expressions are now permitted. (For instance, it is permissible to combine 
integer and complex quantities in an expression.) 

Constant expressions are permitted where a constant is allowed, except in data state¬ 
ments. (A constant expression is made up of explicit constants and parameters and the 
Fortran operators, except for exponentiation to a floating-point power.) An adjustable 
dimension may now be an integer expression involving constants, arguments, and vari¬ 
ables in B common. 

Subscripts may now be general integer expressions; the old cv±c' rules have been 
removed, do loop bounds may be general integer, real, or double precision expressions. 
Computed goto expressions and I/O unit numbers may be general integer expressions. 
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5. Executable Statements 

5.1. IF-THEN-ELSE 

At last, the if-then-else branching structure has been added to Fortran. It is called a 
“Block IT’. A Block If begins with a statement of the form 

if (...) then 

and ends with an 

end if 

statement. Two other new statements may appear in a Block If. There may be several 
else if (...) then 

statements, followed by at most one 
else 

statement. If the logical expression in the Block If statement is true, the statements fol¬ 
lowing it up to the next else if, else, or end If are executed. Otherwise, the next else if 
statement in the group is executed. If none of the else If conditions are true, control 
passes to the statements following the else statement, if any. (The else block must follow 
all else if blocks in a Block If. Of course, there may be Block Ifs embedded inside of 
other Block If structures.) A case construct may be rendered: 

if (s .eq. 'ab') then 
else if (s .eq. 'cd') then 
else 
end if 

5.2. Alternate Returns 

Some of the arguments of a subroutine call may be statement labels preceded by an aster¬ 
isk, as in: 

call joe(j, *10, m, *2) 

A return statement may have an integer expression, such as: 
return k 

If the entry point has n alternate return (asterisk) arguments and if 1 <£</», the return 
is followed by a branch to the corresponding statement label; otherwise the usual return to 
the statement following the call is executed. 

6. Input/Output 

6.1. Format Variables 

A format may be the value of a character expression (constant or otherwise), or be stored 
in a character array, as in: 

write(6, '05)') x 
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6.2. END-, ERR-, and IOSTAT- Clauses 

A read or write statement may contain end—, err—, and iostat— clauses, as in: 

write(6, 101, err—20, iostat—a(4)) 
read(5, 101, err—20, end—30, iostat—x) 

Here 3 and 6 are the units on which the I/O is done, 101 is the statement number of the 
associated format, 20 and 30 are statement numbers, and a and x are integers. If an error 
occurs during I/O, control returns to the program at statement 20. If the end of the file is 
reached, control returns to the program at statement 30. In any case, the variable 
referred to in the Iostat— clause is given a value when the I/O statement finishes. (Yes, 
the value is assigned to the name on the right side of the equal sign.) This value is zero if 
all went well, negative for end of file, and some positive value for errors. 

6.3. Formatted I/O 

6.3.1. Character Constants 

Character constants in formats are copied literally to the output. It is not allowed to read 
into character constants or hollerith fields. 

A format may be specified as a character constant within the read or write statement. 

write(6,'(i2," isn'"'t ”,il)') 7, 4 
produces 

7 isn't 4 

In the example above, the format is the character constant 

02,' isn"t ,il) 

and the imbedded character constant 
isn't 

is copied into the output. 

The example could have been written more legibly by taking advantage of the two types 
of quote marks. 

write(6,'02,” isn"t ",il)') 7, 4 

However, the double quote is not standard Fortran 77. 

6.3.2. Positional Editing Codes 

t, tl, tr, and x codes control where the next character is in the record, trn or ns. specifies 
that the next character is n to the right of the current position, tin specifies that the next 
character is n to the left of the current position, allowing parts of the record to be recon¬ 
sidered. tn says that the next character is to be character number n in the record. (See 
section 3.3 in the main text.) 

6.3.3. Colon 

A colon in the format terminates the I/O operation if there are no more data items in the 
I/O list, otherwise it has no effect. In the fragment 

x—’("hello",:," there", i4)' 
write (6, x) 12 
write (6, x) 

the first write statement prints 
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hello there 12 

while the second only prints 
hello 

<.3.4. Optional Pins Signs 

According to the Standard, each implementation has the option of putting plus signs in 
front of non-negative numeric output. The sp format code may be used to make the 
optional plus signs actually appear for all subsequent items while the format is active. The 
ss format code guarantees that the I/O system will not insert the optional plus signs, and 
the s format code restores the default behavior of the I/O system. (Since we never put 
out optional plus signs, ss and s codes have the same effect in our implementation.) 

<.3.5. Blanks on Input 

Blanks in numeric input fields, other than leading blanks, will be ignored following a bn 
code in a format statement, and will be treated as zeros following a bz code in a format 
statement. The default for a unit may be changed by using the open statement. (Blanks 
are ignored by default.) 

6.3.6. Unrepresentable Values 

The Standard requires that if a numeric item cannot be represented in the form required 
by a format code, the output field must be filled with asterisks. (We think this should 
have been an option.) 

6.3.7. Iw.m 

There is a new integer output code, Iw.m. It is the same as Iw, except that there will be at 
least m digits in the output field, including, if necessary, leading zeros. The case lw.O is 
special, in that if the value being printed is 0, the output field is entirely blank, Iw.l is 
the same as lw. 

6.3.8. Floating Point 

On input, exponents may start with the letter E, D, e, or d. All have the same meaning. 
On output we always use e or d. The e and d format codes also have identical meanings. 
A leading zero before the decimal point in e output without a scale factor is optional with 
the implementation. There is a g w.d format code which is the same as tw.d and tw.d on 
input, but which chooses f or e formats for output depending on the size of the number 
and of d. 

6.3.9. “A” Format Code 

The a code is used for character data, aw uses a field width of w, while a plain a uses the 
length of the internal character item. 

6.4. Standard Units 

There are default formatted input and output units. The statement 
read 10, a, b 

reads from the standard unit using format statement 10. The default unit may be expli¬ 
citly specified by an asterisk, as in 

read(*, 10) a,b 

Similarly, the standard output unit is specified by a print statement or an asterisk unit: 
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6.5. List-Directed Formatting 

List-directed I/O is a kind of free form input for sequential I/O. It is invoked by using an 
asterisk as the format identifier, as in 

read(6, •) a,b,c 

On input, values are separated by strings of blanks and possibly a comma. Values, except 
for character strings, cannot contain blanks. End of record counts as a blank, except in 
character strings, where it is ignored. Complex constants are given as two real constants 
separated by a comma and enclosed in parentheses. A null input held, such as between 
two consecutive commas, means the corresponding variable in the I/O list is not changed. 
Values may be preceded by repetition counts, as in 

4-(3.,2.) 2*. 4*116110' 

which stands for 4 complex constants, 2 null values, and 4 string constants. 

For output, suitable formats are chosen for each item. The values of character strings are 
printed; they are not enclosed in quotes, so they cannot be read back using list-directed 
input. 

6.6. Direct I/O 

A file connected for direct access consists of a set of equal-sized records each of which is 
uniquely identified by a positive integer. The records may be written or read in any order, 
using direct access I/O statements. 

Direct access read and write statements have an extra argument, rec™, which gives the 
record number to be read or written. 

read(2, rec—13, err—20) (a(i), i—1, 203) 

reads the thirteenth record into the array a. 

The size of the records must be given by an open statement (see below). Direct access 
files may be connected for either formatted or unformatted I/O. 

6.7. Internal File* 

Internal files are character string objects, such as variables or substrings, or arrays of type 
character. In the former cases there is only a single record in the file; in the latter case 
each array element is a record. The Standard includes only sequential formatted I/O on 
internal files. (I/O is not a very precise term to use here, but internal files are dealt with 
using read and write.) There is no list-directed I/O on internal files. Internal files are 
used by giving the name of the character object in place of the unit number, as in 

character*80 x 
read(5,'(a)') x 
read(x,'(i3,i4)') nl,n2 

which reads a character string into x and then reads two integers from the front of it. A 
sequential read or write always starts at the beginning of an internal file. 

We also support a compatible extension, direct I/O on internal files. This is like direct I/O 
on external files, except that the number of records in the file cannot be changed. In this 
case a record is a single element of an array of character strings. 
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<.8. OPEN, CLOSE, and INQUIRE Statement! 

These statements are used to connect and disconnect units and files, and to gather infor¬ 
mation about units and files. 

6.8.1. OPEN 

The open statement is used to connect a file with a unit, or to alter some properties of the 
connection. The following is a minimal example. 

open(l, file—'fort.junk') 

open takes a variety of arguments with meanings described below. 

unit— a small non-negative integer which is the unit to which the file is to be connected. 
We allow, at the time of this writing, 0 through 19. If this parameter is the first one 
in the open statement, the unit— can be omitted. 

lost at— is the same as in read or write, 
err— is the same as in read or write. 

file— a character expression, which when stripped of trailing blanks, is the name of the 
file to be connected to the unit. The filename should not be given if the 
status—'scratch'. 

status— one of 'old', 'new', 'scratch', or 'unknown'. If this parameter is not given, 
'unknown' is assumed. The meaning of 'unknown' is processor dependent; our sys¬ 
tem will create the file if it doesn’t exist. If 'scratch' is given, a temporary file will 
be created. Temporary files are destroyed at the end of execution. If 'new' is given, 
the file must not exist. It will be created for both reading and writing. If 'old' is 
given, it is an error for the file not to exist. 

access— 'sequential' or 'direct', depending on whether the file is to be opened for 
sequential or direct I/O. 

form— 'formatted' or 'unformatted'. On UNIX systems form—'print' implies 'formatted' 
with vertical format control. 

reel— a positive integer specifying the record length of the direct access file being opened. 
We measure all record lengths in bytes. On UNIX systems a record length of 1 has 
the special meaning explained in section 5.1 of the text. 

blank— 'null' or 'zero'. This parameter has meaning only for formatted I/O. The default 
value is 'null', 'zero' means that blanks, other than leading blanks, in numeric input 
fields are to be treated as zeros. 

Opening a new file on a unit which is already connected has the effect of first closing the 
old file. 

6.8.2. CLOSE 

close severs the connection between a unit and a file. The unit number must be given. 
The optional parameters are lostat — and err— with their usual meanings, and status— 
either keep' or 'delete'. For scratch files the default is 'delete'; otherwise 'keep' is the 
default, 'delete' means the file will be removed. A simple example is 

close(3, err—17) 

6.8.3. INQUIRE 

The Inquire statement gives information about a unit (“inquire by unit”) or a file 
(“inquire by file”). Simple examples are: 
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inquire (unit-3, namexx) 

inquire (file—'junk', number—n, exist—1) 


file™ a character variable specifies the file the Inquire is about. Trailing blanks in the file 
name are ignored. 

unit— an integer variable specifies the unit the inquire is about. Exactly one of file— or 
unit— must be used. 

iostat—, err— are as before. 

exist— a logical variable. The logical variable is set to .true, if the file or unit exists and 
is set to .false, otherwise. 

opened— a logical variable. The logical variable is set to .true, if the file is connected to 
a unit or if the unit is connected to a file, and it is set to .false, otherwise. 

number— an integer variable to which is assigned the number of the unit connected to 
the file, if any. 

named- a logical variable to which is assigned .true, if the file has a name, or .false, 
otherwise. 

name- a character variable to which is assigned the name of the file (inquire by file) or 
the name of the file connected to the unit (inquire by unit). The name will be the 
full name of the file. 

access— a character variable to which will be assigned the value 'sequential' if the con¬ 
nection is for sequential I/O, 'direct' if the connection is for direct I/O. The value 
becomes undefined if there is no connection. 

sequential- a character variable to which is assigned the value 'yes' if the file could be 
connected for sequential I/O, 'no' if the file could not be connected for sequential 
I/O, and 'unknown' if we can’t tell. 

direct- a character variable to which is assigned the value 'yes' if the file could be con¬ 
nected for direct I/O, 'no' if the file could not be connected for direct I/O, and 'unk¬ 
nown' if we can’t tell. 

form— a character variable to which is assigned the value 'unformatted' if the file is con¬ 
nected for unformatted I/O, 'formatted' if the file is connected for formatted I/O, or 
'print' for formatted I/O with vertical format control. 

formatted— a character variable to which is assigned the value 'yes' if the file could be 
connected for formatted I/O, 'no' if the file could not be connected for formatted 
I/O, and 'unknown' if we can’t tell. 

unformatted— a character variable to which is assigned the value 'yes' if the file could be 
connected for unformatted I/O, 'no' if the file could not be connected for unformat¬ 
ted I/O, and 'unknown' if we can’t tell. 

red— an integer variable to which is assigned the record length of the records in the file 
if the file is connected for direct access. 

nextrec— an integer variable to which is assigned one more than the number of the the 
last record read from a file connected for direct access. 

blank- a character variable to which is assigned the value 'null' if null blank control is in 
effect for the file connected for formatted I/O, 'zero' if blanks are being converted to 
zeros and the file is connected for formatted I/O. 

The gentle reader will remember that the people who wrote the Standard probably weren’t 

thinking of his needs. Here is an example. The declarations are omitted. 

open(l, file—'/dev/console') 

On a UNIX system this statement opens the console for formatted sequential I/O. An 
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inquire statement for either unit 1 or file "/dev/console" would reveal that the file exists, 
is connected to unit 1, has a name, namely "/dev/console", is opened for sequential I/O, 
could be connected for sequential I/O, could not be connected for direct I/O (can’t seek), 
is connected for formatted I/O, could be connected for formatted I/O, could not be con¬ 
nected for unformatted I/O (can’t seek), has neither a record length nor a next record 
number, and is ignoring blanks in numeric fields. 

In the FORTRAN environment, the only way to discover what permissions you have for a 
file is to open it and try to read and write it. The err™ parameter will return system error 
numbers. The Inquire statement does not give a way of determining permissions. 

For further discussion of the UNIX Fortran I/O system see “Introduction to the f77 I/O 
Library” [9). 
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Introduction to the f77 I/O Library 

David L. Wasley 

University of California, Berkeley 
Berkeley, California 94720 


The f77 I/O library, libI77.a, includes routines to perform all of the standard types of 
FORTRAN input and output. Several enhancements and extensions to FORTRAN I/O have 
been added. The f77 library routines use the C stdio library routines to provide efficient 
buffering for file I/O. 


1. FORTRAN I/O 

The requirements of the ANSI standard impose significant overhead on programs that do 
large amounts of I/O. Formatted I/O can be very “expensive” while direct access binary I/O is 
usually very efficient. Because of the complexity of FORTRAN I/O, some general concepts 
deserve clarification. 

1.1. Types of I/O 

There are three forms of I/O: formatted, unformatted, and list-directed. The last 
is related to formatted but does not obey all the rules for formatted I/O. There are two 
modes of access to external and internal files: direct and sequential. The definition of a 
logical record depends upon the combination of I/O form and mode specified by the FOR¬ 
TRAN I/O statement. 

1.1.1. Direct access 

A logical record in a direct access external file is a string of bytes of a length specified 
when the file is opened. Read and write statements must not specify logical records longer 
than the original record size definition. Shorter logical records are allowed. Unformatted 
direct writes leave the unfilled part of the record undefined. Formatted direct writes cause 
the unfilled record to be padded with blanks. 

1.1.2. Sequential access 

Logical records in sequentially accessed external files may be of arbitrary and vari¬ 
able length. Logical record length for unformatted sequential files is determined by the size 
of items in the iolist. The requirements of this form of I/O cause the external physical record 
size to be somewhat larger than the logical record size. For formatted write statements, logi¬ 
cal record length is determined by the format statement interacting with the iolist at execu¬ 
tion time. The “newline” character is the logical record delimiter. Formatted sequential 
access causes one or more logical records ending with “newline” characters to be read or writ¬ 
ten. 

1.1.3. List directed I/O 

Logical record length for list-directed I/O is relatively meaningless. On output, the 
record length is dependent on the magnitude of the data items. On input, the record length is 
determined by the data types and the file contents. 
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1.1.4. Internal I/O 

The logical record length for an internal read or write is the length of the character 
variable or array element. Thus a simple character variable is a single logical record. A charac¬ 
ter variable array is similar to a fixed length direct access file, and obeys the same rules. 
Unformatted I/O is not allowed on "internal” files. 

1.2. I/O execution 

Note that each execution of a FORTRAN unformatted I/O statement causes a single 
logical record to be read or written. Each execution of a FORTRAN formatted I/O statement 
causes one or more logical records to be read or written. 

A slash, “/”, will terminate assignment of values to the input list during list-directed 
input and the remainder of the current input line is skipped. The standard is rather vague on 
this point but seems to require that a new external logical record be found at the start of any 
formatted input. Therefore data following the slash is ignored and may be used to comment 
the data file. 

Direct access list-directed I/O is not allowed. Unformatted internal I/O is not 
allowed. Both the above will be caught by the compiler. All other flavors of I/O are allowed, 
although some are not part of the ANSI standard. 

Any error detected during I/O processing will cause the program to abort unless alterna¬ 
tive action has been provided specifically in the program. Any I/O statement may include an 
err= clause (and iostat= clause) to specify an alternative branch to be taken on errors (and 
return the specific error code). Read statements may include end= to branch on end-of-file. 
File position and the value of I/O list items is undefined following an error. 


2. Implementation details 

Some details of the current implementation may be useful in understanding constraints 
on FORTRAN I/O. 

2.1. Number of logical units 

The maximum number of logical units that a program may have open at one time is the 
same as the UNIXt system limit, currently 20. Unit numbers must be in the range 0-19 
because they are used to index an internal control table. 

2.2. Standard logical units 

By default, logical units 0, 5, and 6 are opened to “stderr”, “stdin”, and “stdout” respec¬ 
tively. However they can be re-defined with an open statement. To preserve error reporting, 
it is an error to close logical unit 0 although it may be reopened to another file. 

If you want to open the default file name for any preconnected logical unit, remember to 
close the unit first. Redefining the standard units may impair normal console I/O. An alter¬ 
native is to use shell re-direction to externally re-define the above units. To re-define default 
blank control or format of the standard input or output files, use the open statement specify¬ 
ing the unit number and no file name (see § 2.4). 

The standard units, 0, 5, and 6, are named internally “stderr”, “stdin”, and “stdout” 
respectively. These are not actual file names and can not be used for opening these units. 
Inquire will not return these names and will indicate that the above units are not named 
unless they have been opened to real files. The names are meant to make error reporting 
more meaningful. 


t UNIX is a trademark of Bell Laboratories. 
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2.3. Vertical format control 

Simple vertical format control is implemented. The logical unit must be opened for 
sequential access with form = ’print’ (see §3.2). Control codes “0” and “1” are replaced in 
the output file with “\n” and “\f” respectively. The control character “+” is not implemented 
and, like any other character in the first position of a record written to a “print” file, is 
dropped. No vertical format control is recognized for direct formatted output or list 
directed output. 

2.4. The open statement 

An open statement need not specify a file name. If it refers to a logical unit that is 
already open, the blank = and form= specifiers may be redefined without affecting the 
current file position. Otherwise, if status = ’scratch’ is specified, a temporary file with a 
name of the form “tmp.FXXXXXX” will be opened, and, by default, will be deleted when 
closed or during termination of program execution. Any other status = specifier without an 
associated file name results in opening a file named “fort.N” where N is the specified logical 
unit number. 

It is an error to try to open an existing file with status = ’new’ . It is an error to try to 
open a nonexistent file with status = ’old’ . By default, status = ’unknown’ will be 
assumed, and a file will be created if necessary. 

By default, files are positioned at their beginning upon opening, but see ioinit( 3f) for 
alternatives. Existing files are never truncated on opening. Sequentially accessed external 
files are truncated to the current file position on close , backspace , or rewind only if the 
last access to the file was a write. An endfile always causes such files to be truncated to the 
current file position. 

2.5. Format interpretation 

Formats are parsed at the beginning of each execution of a formatted I/O statement. 
Upper as well as lower case characters are recognized in format statements and all the alpha¬ 
betic arguments to the I/O library routines. 

If the external representation of a datum is too large for the field width specified, the 
specified field is filled with asterisks (*). On Ew.dEe output, the exponent field will be filled 
with asterisks if the exponent representation is too large. This will only happen if “e” is zero 
(see appendix B). 

On output, a real value that is truly zero will display as “0.” to distinguish it from a very 
small non-zero value. This occurs in F and G format conversions. This was not done for E 
and D since the embedded blanks in the external datum causes problems for other input sys¬ 
tems. 

Non-destructive tabbing is implemented for both internal and external formatted I/O. 
Tabbing left or right on output does not affect previously written portions of a record. Tab¬ 
bing right on output causes unwritten portions of a record to be filled with blanks. Tabbing 
right off the end of an input logical record is an error. Tabbing left beyond the beginning of 
an input logical record leaves the input pointer at the beginning of the record. The format 
specifier T must be followed by a positive non-zero number. If it is not, it will have a 
different meaning (see §3.1). 

Tabbing left requires seek ability on the logical unit. Therefore it is not allowed in I/O 
to a terminal or pipe. Likewise, nondestructive tabbing in either direction is possible only on 
a unit that can seek. Otherwise tabbing right or spacing with X will write blanks on the out¬ 
put. 
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2.6. List directed output 

In formatting list directed output, the I/O system tries to prevent output lines longer 
than 80 characters. Each external datum will be separated by two spaces. List-directed out¬ 
put of complex values includes an appropriate comma. List-directed output distinguishes 
between real and double precision values and formats them differently. Output of a char¬ 
acter string that includes “\n” is interpreted reasonably by the output system. 

2.7. I/O errors 

If I/O errors are not trapped by the user’s program an appropriate error message will be 
written to “stderr” before aborting. An error number will be printed in [ ] along with a brief 
error message showing the logical unit and I/O state. Error numbers < 100 refer to UNIX 
errors, and are described in the introduction to chapter 2 of the UNIX Programmer’s Manual. 
Error numbers > 100 come from the I/O library, and are described further in the appendix to 
this writeup. For internal I/O, part of the string will be printed with “I” at the current posi¬ 
tion in the string. For external I/O, part of the current record will be displayed if the error 
was caused during reading from a file that can backspace. 


3. Non-“ANSI Standard” extensions 

Several extensions have been added to the I/O system to provide for functions omitted 
or poorly defined in the standard. Programmers should be aware that these are non-portable. 

3.1. Format specifiers 

B is an acceptable edit control specifier. It causes return to the default mode of blank 
interpretation. This is consistent with S which returns to default sign control. 

P by itself is equivalent to OP . It resets the scale factor to the default value, 0. 

The form of the Ew.dEe format specifier has been extended to D also. The form Ew.d.e 
is allowed but is not standard. The “e” field specifies the minimum number of digits or 
spaces in the exponent field on output. If the value of the exponent is too large, the exponent 
notation e or d will be dropped from the output to allow one more character position. If this 
is still not adequate, the “e” field will be filled with asterisks (*). The default value for “e” is 
2 . 

An additional form of tab control specification has been added. The ANSI standard 
forms TRn, TLn, and Tn are supported where n is a positive non-zero number. If T or nT is 
specified, tabbing will be to the next (or n-th) 8-column tab stop. Thus columns of 
alphanumerics can be lined up without counting. 

A format control specifier has been added to suppress the newline at the end of the last 
record of a formatted sequential write. The specifier is a dollar sign ($). It is constrained by 
the same rules as the colon (:). It is used typically for console prompts. For example: 

write (*, "(’enter value for x: ’,$)”) 
read (*,*) x 

Radices other than 10 can be specified for formatted integer I/O conversion. The 
specifier is patterned after P, the scale factor for floating point conversion. It remains in effect 
until another radix is specified or format interpretation is complete. The specifier is defined as 
[n]R where 2 < n s£ 36. If n is omitted, the default decimal radix is restored. 

In conjunction with the above, a sign control specifier has been added to cause integer 
values to be interpreted as unsigned during output conversion. The specifier is SU and 
remains in effect until another sign control specifier is encountered, or format interpretation is 
complete. Radix and “unsigned” specifiers could be used to format a hexadecimal dump, as 
follows: 
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2000 format ( SU, 16R, 8110.8 ) 


Note: Unsigned integer values greater than (2**30 - 1), i.e. any signed negative value, can not 
be read by FORTRAN input routines. All internal values will be output correctly. 

3.2. Print files 

The ANSI standard is ambiguous regarding the definition of a “print” file. Since UNIX 
has no default “print” file, an additional form= specifier is now recognized in the open state¬ 
ment. Specifying form = ’print* implies formatted and enables vertical format control for 
that logical unit. Vertical format control is interpreted only on sequential formatted writes to 
a “print” file. 

The inquire statement will return print in the form= string variable for logical units 
opened as “print” files. It will return -1 for the unit number of an unconnected file. 

If a logical unit is already open, an open statement including the form= option or the 
blank = option will do nothing but re-define those options. This instance of the open state¬ 
ment need not include the file name, and must not include a file name if unit= refers to a 
standard input or output. Therefore, to re-define the standard output as a “print” file, use: 


open (unit=6, form=’print’) 

3.3. Scratch files 

A close statement with status = ’keep* may be specified for temporary files. This is 
the default for all other files. Remember to get the scratch file’s real name, using inquire , if 
you want to re-open it later. 

3.4. List directed I/O 

List directed read has been modified to allow input of a string not enclosed in quotes. 
The string must not start with a digit, and can not contain a separator (, or /) or blank (space 
or tab). A newline will terminate the string unless escaped with x Any string not meeting the 
above restrictions must be enclosed in quotes (” or ’). 

Internal list-directed I/O has been implemented. During internal list reads, bytes are 
consumed until the iolist is satisfied, or the ’end-of-file’ is reached. During internal list writes, 
records are filled until the iolist is satisfied. The length of an internal array element should be 
at least 20 bytes to avoid logical record overflow when writing double precision values. Inter¬ 
nal list read was implemented to make command line decoding easier. Internal list write 
should be avoided. 


4. Running older programs 

Traditional FORTRAN environments usually assume carriage control on all logical units, 
usually interpret blank spaces on input as “0”s, and often provide attachment of global file 
names to logical units at run time. There are several routines in the I/O library to provide 
these functions. 

4.1. Traditional unit control parameters 

If a program reads and writes only units 5 and 6, then including -1166 in the f77 com¬ 
mand will cause carriage control to be interpreted on output and cause blanks to be zeros on 
input without further modification of the program. If this is not adequate, the routine 
ioinit( 3f) can be called to specify control parameters separately, including whether files should 
be positioned at their beginning or end upon opening. 
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4.2. Preattachment of logical units 

The ioinit routine also can be used to attach logical units to specific files at run time. It 
will look for names of a user specified form in the environment and open the corresponding 
logical unit for sequential formatted I/O. Names must be of the form PREFIXnn where 
PREFIX is specified in the call to ioinit and nn is the logical unit to be opened. Unit 
numbers < 10 must include the leading “0”. 

Ioinit should prove adequate for most programs as written. However, it is written in 
FORTRAN-77 specifically so that it may serve as an example for similar user-supplied rou¬ 
tines. A copy may be retrieved by “ar x /usr/lib/libI77.a ioinit.f”. 


5. Magnetic tape I/O 

Because the I/O library uses stdio buffering, reading or writing magnetic tapes should be 
done with great caution, or avoided if possible. A set of routines has been provided to read 
and write arbitrary sized buffers to or from tape directly. The buffer must be a character 
object. Internal I/O can be used to fill or interpret the buffer. These routines do not use 
normal FORTRAN I/O processing and do not obey FORTRAN I/O rules. See tapeio(3i). 


6. Caveat Programmer 

The I/O library is extremely complex yet we believe there are few bugs left. We’ve tried 
to make the system as correct as possible according to the ANSI X3.9—1978 document and 
keep it compatible with the UNIX file system. Exceptions to the standard are noted in appen¬ 
dix B. 
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Appendix A 

I/O Library Error Messages 

The following error messages are generated by the I/O library. The error numbers are 
returned in the iostat= variable if the err= return is taken. Error numbers < 100 are gen¬ 
erated by the UNIX kernel. See the introduction to chapter 2 of the UNIX Programmers 
Manual for their description. 

/* 100 */ "error in format" 



See error message output for the location 
of the error in the format. Can be caused 
by more than 10 levels of nested (), or 
an extremely long format statement. 

/* 101 */ 

"illegal unit number" 

It is illegal to close logical unit 0. 

Negative unit numbers are not allowed. 

The upper limit is system dependent. 

/* 102 */ 

"formatted io not allowed" 

The logical unit was opened for 
unformatted I/O. 

/* 103 */ 

"unformatted io not allowed" 

The logical unit was opened for 
formatted I/O. 

/* 104 */ 

"direct io not allowed" 

The logical unit was opened for sequential 
access, or the logical record length was 
specified as 0. 

/* 105 V 

"sequential io not allowed" 

The logical unit was opened for direct 
access I/O. 

/* 106 */ 

"can’t backspace file” 

The file associated with the logical unit 
can’t seek. May be a device or a pipe. 

/* 107 */ 

"off beginning of record" 

The format specified a left tab beyond the 
beginning of an internal input record. 

/* 108 */ 

"can’t stat file" 

The system can’t return status information 
about the file. Perhaps the directory is 
unreadable. 

/* 109 */ 

"no * after repeat count" 

Repeat counts in list-directed I/O must be 
followed by an * with no blank spaces. 
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/* 110 * 

/* 111 * 

/* 112 * 

/* 113 * 

/* 114 */ 
/* 115 *, 

/* 116 */ 
/* 117 */ 

/* 118 V 

/* 119 */ 
/* 120 V 

/* 121 */ 


/ "off end of record” 

A formatted write tried to go beyond the 
logical end-of-record. An unformatted read 
or write will also cause this. 


/ "truncation failed” 

The truncation of an external sequential file on 
’close’, ’backspace’, ’rewind’ or ’endfile’ failed. 


"incomprehensible list input” 
List input has to be just right. 


"out of free space” 

The library dynamically creates buffers for 
internal use. You ran out of memory for this. 
Your program is too big! 


"unit not connected” 

The logical unit was not open. 


"read unexpected character” 

Certain format conversions can’t tolerate 
non-numeric data. Logical data must be 
Tor F. 


' "blank logical input field” 

”’new’ file exists” 

You tried to open an existing file with 
”status=’new’”. 


"can’t find ’old’ file” 

You tried to open a non-existent file 
with ”status=’old’”. 


"unknown system error 
Shouldn’t happen, but 


» 



"requires seek ability” 

Direct access requires seek ability. 
Sequential unformatted I/O requires seek 
ability on the file due to the special 
data structure required. Tabbing left 
also requires seek ability. 


"illegal argument” 

Certain arguments to ’open’, etc. will be 
checked for legitimacy. Often only non¬ 
default forms are looked for. 
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/* 122 */ "negative repeat count" 


The repeat count for list directed input 
must be a positive integer. 


/* 123 */ "illegal operation for unit" 

An operation was requested for a device 
associated with the logical unit which 
was not possible. This error is returned 
by the tape I/O routines if attempting to 
read past end-of-tape, etc. 
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Appendix B 

Exceptions to the ANSI Standard 

A few exceptions to the ANSI standard remain. 

1) Vertical format control 

The “+” carriage control specifier is not implemented. It would be difficult to imple¬ 
ment it correctly and still provide UNIX-like file I/O. 

Furthermore, the carriage control implementation is asymmetrical. A file written with 
carriage control interpretation can not be read again with the same characters in column 1. 

An alternative to interpreting carriage control internally is to run the output file through 
a “FORTRAN output filter” before printing. This filter could recognize a much broader range 
of carriage control and include terminal dependent processing. 

2) Default files 

Files created by default use of rewind or endfile statements are opened for sequen¬ 
tial formatted access. There is no way to redefine such a file to allow direct or unformat¬ 
ted access. 

3) Lower case strings 

It is not clear if the ANSI standard requires internally generated strings to be upper case 
or not. As currently written, the inquire statement will return lower case strings for any 
alphanumeric data. 

4) Exponent representation on Ew.dEe output 

If the field width for the exponent is too small, the standard allows dropping the 
exponent character but only if the exponent is > 99. This system does not enforce that restric¬ 
tion. Further, the standard implies that the entire field, ‘w\ should be filled with asterisks if 
the exponent can not be displayed. This system fills only the exponent field in the above case 
since that is more diagnostic. 
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